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Abstract 

Decision theory formally solves the problem of rational agents in uncertain worlds 
if the true environmental prior probability distribution is known. Solomonoff 's the- 
ory of universal induction formally solves the problem of sequence prediction for 
unknown prior distribution. We combine both ideas and get a parameterless theory 
of universal Artificial Intelligence. We give strong arguments that the resulting AI^ 
model is the most intelligent unbiased agent possible. We outline for a number of 
problem classes, including sequence prediction, strategic games, function minimiza- 
tion, reinforcement and supervised learning, how the AI^ model can formally solve 
them. The major drawback of the AI^ model is that it is uncomputable. To over- 
come this problem, we construct a modified algorithm AI^*', which is still effectively 
more intelligent than any other time t and space I bounded agent. The computation 
time of AI^*' is of the order t-2K Other discussed topics are formal definitions of 
intelligence order relations, the horizon problem and relations of the AI^ theory to 
other AI approaches. 
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1 Introduction 



Artificial Intelligence: The science of Artificial Intelligence (AI) might be defined 
as the construction of intelligent systems and their analysis. A natural definition of 
systems is anything which has an input and an output stream. Intelligence is more 
complicated. It can have many faces like creativity, solving problems, pattern recognition, 
classification, learning, induction, deduction, building analogies, optimization, surviving 
in an environment, language processing, knowledge and many more. A formal definition 
incorporating every aspect of intelligence, however, seems difficult. Further, intelligence 
is graded, there is a smooth transition between systems, which everyone would agree 
to be not intelligent and truely intelligent systems. One simply has to look in nature, 
starting with, for instance, inanimate crystals, then come amino-acids, then some RNA 
fragments, then viruses, bacteria, plants, animals, apes, followed by the truly intelligent 
homo sapiens, and possibly continued by AI systems or ET's. So the best we can expect 
to find is a partial or total order relation on the set of systems, which orders them w.r.t. 
their degree of intelligence (like intelligence tests do for human systems, but for a limited 
class of problems). Having this order we are, of course, are interested in large elements, 
i.e. highly intelligent systems. If a largest element exists, it would correspond to the most 
intelligent system which could exist. 

Most, if not all known facets of intelligence can be formulated as goal driven or, more 
precisely, as maximizing some utility function. It is, therefore, sufficient to study goal 
driven AI. E.g. the (biological) goal of animals and humans is to survive and spread. 
The goal of AI systems should be to be useful to humans. The problem is that, except 
for special cases, we know neither the utility function, nor the environment in which the 
system will operate, in advance. 



Main idea: We propose a theory which formally^ solves the problem of unknown goal 
and environment. It might be viewed as a unification of the ideas of universal induction, 
probabilistic planning and reinforcement learning or as a unification of sequential decision 
theory with algorithmic information theory. We apply this model to some of the facets 
of intelligence, including induction, game playing, optimization, reinforcement and super- 
vised learning, and show how it solves these problem classes. This, together with general 
convergence theorems motivates us to believe that the constructed universal AI system 
is the best one in a sense to be clarified in the sequel, i.e. that it is the most intelligent 
environmental independent system possible. The intention of this work is to introduce 
the universal AI model and give an in breadth analysis. Most arguments and proofs are 
succinct and require slow reading or some additional pencil work. 



Contents: Section The general framework for AI might be viewed as the design and 



study of intelligent agents |31]. An agent is a cybernetic system with some internal state. 



which acts with output yk to some environment in cycle k, perceives some input Xk from 



^ With a formal solution we mean a rigorous mathematically definition, uniquely specifying the solution. 
In the following, a solution is always meant in this formal sense. 
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the environment and updates its internal state. Then the next cycle follows. It operates 
according to some function p. We split the input Xk into a regular part x'^, and a credit 
Cfc, often called reinforcement feedback. From time to time the environment provides 
non-zero credit to the system. The task of the system is to maximize its utility, defined as 
the sum of future credits. A probabilistic environment is a probability distribution 
over deterministic environments q. Most, if not all environments are of this type. We 
give a formal expression for the function p*, which maximizes in every cycle the total fi 
expected future credit. This model is called the Al/i model. As every AI problem can be 
brought into this form, the problem of maximizing utility is hence being formally solved, 
if fi is known. There is nothing remarkable or new here, it is the essence of sequential 
decision theory [P, ^ Notation and formulas needed in later sections are simply 



developed. There are two major remaining problems. The problem of the unknown true 
prior probability yU is solved in section ^. Computational aspects are addressed in section 



Section^: Instead of talking about probability distributions ^{q) over functions, one could 
describe the environment by the conditional probability of providing inputs to the 

system under the condition that the system outputs yi...yn- The definition of the optimal 
p* system in this iterative form is shown to be equivalent to the previous functional form. 
The functional form is more elegant and will be used to define an intelligence order relation 
and the time-bounded model in section |T^. The iterative form is more index intensive but 
more suitable for explicit calculations and is used in most of the other sections. Further, 
we introduce factorizable probability distributions. 

Section 0.- A special topic is the theory of induction. In which sense prediction of the 
future is possible at all, is best summarized by the theory of Solomonoff. Given the initial 
binary sequence xi...Xk, what is the probability of the next bit being 1? It can be fairly well 
predicted by using a universal probability distribution ^ invented and shown to converge to 
the true prior probability fi by Solomonoff |]35| , p6| as long as fi (which needs not be known!) 



is computable. The problem of unknown /x is hence solved for induction problems. All AI 
problems where the systems' output does not influence the environment, i.e. all passive 
systems are of this inductive form. Besides sequence prediction (SP), classification(CF) 
is also of this type. Active systems, like game playing (SG) and optimization (FM), can 
not be reduced to induction systems. The main idea of this work is to generalize 
universal induction to the general cybernetic model described in sections |^ and ^ For 
this, we generalize ^ to include conditions and replace yU by ^ in the rational agent model. 
In this way the problem that the true prior probability /i is usually unknown is solved. 
Universality of C, and convergence of ^ ^ fi will be shown. These are strong arguments 
for the optimality of the resulting AI^ model. There are certain difficulties in proving 
rigorously that and in which sense it is optimal, i.e. the most intelligent system. Further, 
we introduce a universal order relation for intelligence. 

Sections |^-|^ show how a number of AI problem classes fit into the general AI^ model. All 
these problems are formally solved by the AI^ model. The solution is, however, only formal 
because the AI^ model developed thus far is uncomputable or, at best, approximable. 
These sections should support the claim that every AI problem can be formulated (and 
hence solved) within the AI^ model. For some classes we give concrete examples to 
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illuminate the scope of the problem class. We first formulate each problem class in its 
natural way (when ^p''°''i°™ is known) and then construct a formulation within the AI/x 
model and prove its equivalence. We then consider the consequences of replacing fi by C,- 
The main goal is to understand why and how the problems are solved by AI,^. We only 
highlight special aspects of each problem class. Sections together should give a better 
picture of the AI^ model. We do not study every aspect for every problem class. The 
sections might be read selectively. They are not necessary to understand the remaining 
sections. 

Section Using the AlyU model for sequence prediction (SP) is identical to Baysian 
sequence prediction SP6^. One might expect, when using the AI^ model for sequence 
prediction, one would recover exactly the universal sequence prediction scheme SFB^, as 
Al,^ was a unification of the AlyU model and the idea of universal probability ^. Unfor- 
tunately this is not the case. One reason is that ^ is only a probability distribution in 
the inputs x and not in the outputs y. This is also one of the origins of the difficulty 
of proving error/credit bounds for AI^. Nevertheless, we argue that AI^ is equally well 
suited for sequence prediction as SP9^ is. In a very limited setting we prove a (weak) 
error bound for AI^ which gives hope that a general proof is attainable. 

Section A very important class of problems are strategic games (SG). We restrict our- 
selves to deterministic strictly competitive strategic games like chess. If the environment 
is a minimax player, the AI/x model itself reduces to a minimax strategy. Repeated games 
of fixed lengths are a special case for factorizable fi. The consequences of variable game 
length is sketched. The AI^ model has to learn the rules of the game under consideration, 
as it has no prior information about these rules. We describe how AI^ actually learns 
these rules. 

Section There are many problems that fall into the category 'resource bounded function 
minimization' (FM). They include the Traveling Salesman Problem, minimizing produc- 
tion costs, inventing new materials or even producing, e.g. nice paintings, which are 
(subjectively) judged by a human. The task is to (approximately) minimize some func- 
tion f :Y ^ Z within minimal number of function calls. We will see that a greedy model 
trying to minimize / in every cycle fails. Although the greedy model has nothing to 
do with downhill or gradient techniques (there is nothing like a gradient or direction for 
functions over Y) which are known to fail, we discover the same difficulties. FM has 
already nearly the full complexity of general AI. The reason being that FM can actively 
influence the information gathering process by its trials i/k (whereas SP and CF cannot). 
We discuss in detail the optimal FM/x model and its inventiveness in choosing the y&Y. 
A discussion of the subtleties when using AI^ for function minimization, follows. 

Section Reinforcement learning, as the AI^ model does, is an important learning tech- 
nique but not the only one. To improve the speed of learning, supervised learning, i.e. 
learning by acquiring knowledge, or learning from a constructive teacher is necessary. We 
show, how AI^ learns to learn supervised. It actually establishes supervised learning very 
quickly within 0(1) cycles. 

Section\^ gives a brief survey of other general aspects, ideas and methods in AI, and their 
connection to the AI^ model. Some aspects are directly included, others are or should be 
emergent. 
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Section [I^; Up to now we have shown the universal character of the AI^ model but have 
completely ignored computational aspects. Let us assume that there exists some algorithm 
p of size / with computation time per cycle t, which behaves in a sufficiently intelligent 
way (this assumption is the very basis of AI). The algorithm p* should run all algorithms 
of length < / for t time steps in every cycle and select the best output among them. So we 
have an algorithm which runs in time /■2* and is at least as good as p, i.e. it also serves our 
needs apart from the (very large but) constant multiplicative factor in computation time. 
This idea of the 'typing monkeys', one of them eventually producing 'Shakespeare', is well 
known and widely used in theoretical computer science. The difficult part is the selection 
of the algorithm with the best output. A further complication is that the selection process 
itself must have only limited computation time. We present a suitable modification of 
the AI^ model which solves these difficult problems. The solution is somewhat involved 
from an implement at ional aspect. An implementation would include first order logic, the 
definition of a Universal Turing machine within it and proof theory. The assumptions 
behind this construction are discussed at the end. 



Section [7j contains some discussion of otherwise unmentioned topics and some (personal) 



remarks. It also serves as an outlook to further research. 



Section li contains the conclusions. 



History 8z References: Kolmogorov65 ||T^ suggested to define the information content 
of an object as the length of the shortest program computing a representation of it. 
Solomonoff64 [^] invented the closely related universal prior probability distribution and 
used it for binary sequence prediction ||35|, ^ and function inversion and minimization 
| |37| |. Together with Chaitin66&75 |^, ^ this was the invention of what is now called 
Algorithmic Information theory. For further literature and many applications see . 
Other interesting 'applications' can be found in [§, |3^, Related topics are the Weighted 



Majority Algorithm invented by Littlestone and Warmuth89 |20], universal forecasting by 
Vovk92 PI, Levin search73 pac-learning introduced by Valiant84 and Minimum 
Description Length Resource bounded complexity is discussed in p|, |16|, 

resource bounded universal probability in [^, Implementations are rare [0, [Sc 



Excellent reviews with a philosophical touch are ^ . For an older, but general review 
of inductive inference see Angluin83 [|^. For an excellent introduction into algorithmic 
information theory, further literature and many applications one should consult the book 
of Li and Vitanyi97 



The survey p2[ or the chapters 4 and 5 of pj] should be 
sufficient to follow the arguments and proofs in this paper. The other ingredient in our 
AI^ model is sequential decision theory. We do not need much more than the maximum 
expected utility principle and the expecimax algorithm |25, 31|. The book of von Neumann 
and Morgenstern44 



_I0[ might be seen as the initiation of game theory, which already 
contains the expectimax algorithm as a special case. The literature on decision theory 
is vast and we only give two possibly interesting references with regard to this paper. 
Cheeseman85&88 is a defense of the use of probability theory in AI. Pearl88 is a 
good introduction and overview of probabilistic reasoning. 
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2 The Al/2 Model in Functional Form 

The cybernetic or agent model: A good way to start thinking about intelligent 
systems is to consider more generally cybernetic systems, in Al usually called agents. 
This avoids having to struggle with the meaning of intelligence from the very beginning. 
A cybernetic system is a control circuit with input y and output x and an internal state. 
Prom an external input and the internal state the system calculates deterministically or 
stochastically an output. This output (action) modifies the environment and leads to a 
new input (reception). This continues ad infinitum or for a finite number of cycles. As 
explained in the last section, we need some credit assignment to the cybernetic system. 
The input x is divided into two parts, the standard input x' and some credit input c. 
If input and output are represented by strings, a deterministic cybernetic system can be 
modeled by a Turing machine p. p is called the policy of the agent, which determines 
the action to a receipt. If the environment is also computable it might be modeled by 
a Turing machine q as well. The interaction of the agent with the environment can be 
illustrated as follows: 




working 



tape ... working 



Environ— 

me lit q 



2/1 


2/2 


2/3 


2/4 


2/5 


2/6 





p as well as q have unidirectional input and output tapes and bidirectional working tapes. 
What entangles the agent with the environment, is the fact that the upper tape serves as 
input tape for p, as well as output tape for q, and that the lower tape serves as output 
tape for p as well as input tape for q. Further, the reading head must always be left of the 
writing head, i.e. the symbols must first be written, before they are read, p and q have 
their own mutually inaccessible working tapes containing their own 'secrets'. The heads 
move in the following way. In the k^^ cycle p writes yk, q reads yk, q writes Xk = Ckx'i^, 
p reads x^ = Ckx'f^, followed by the {k + 1)*^ cycle and so on. The whole process starts 
with the first cycle, all heads on tape start and working tapes being empty. We want 
to call Turing machines behaving in this way, chronological Turing machines, for obvious 
reasons. Before continuing, some notations on strings are appropriate. 



Strings: We will denote strings over the alphabet X hy s — xiX2...Xn, with Xk £ X, 
where X is alternatively interpreted as a non-empty subset of IV or itself as a prefix free 
set of binary strings. l{s) = /(a;i)+ ... +/(a;„) is the length of s. Analogous definitions hold 
for yk^Y. We call Xk the k*'^ input word and y^ the k^^ output word (rather than letter). 
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The string s = represents the input /output in chronological order. Due to the 

prefix property of the Xk and i/k, s can be uniquely separated into its words. The words 
appearing in strings are always in chronological order. We further introduce the following 
abbreviations: e is the empty string, Xn-.m '■= XnXn+i---Xm-iXm forn < m and e for n > m. 
x<n ■= xi...Xn-i. Analog for y. Further, yiCn-= VnXn, Wn:yn- = ynXn---yraXm, and so on. 

AI model for known deterministic environment: Let us define for the chronolog- 
ical Turing machine p a partial function also named p:X* -^Y* with yi-^ = p{x<:k) where 
yi± is the output of Turing machine p on input x<fc in cycle k, i.e. where p has read up to 
Xk-i but no further. In an analogous way, we define q:Y*^X* with xi.k = qiyi-.k)- Con- 
versely, for every partial recursive chronological function we can define a corresponding 
chronological Turing machine. Each (system,environment) pair (p, q) produces a unique 
I/O sequence a;(p, q) := yi^x'iy2'x^2---- When we look at the definition of p and q we see a 
nice symmetry between the cybernetic system and the environment. Until now, not much 
intelligence is in our system. Now the credit assignment comes into the game and removes 
the symmetry somewhat. We split the input XfcGX: = CxX' into a regular part x'^gX' 
and a credit Ck E C (Z M. We define Xk = Ckx'f, and = c(xfc). The goal of the system 
should be to maximize received credits. This is called reinforcement learning. The reason 
for the asymmetry is, that eventually we (humans) will be the environment with which 
the system will communicate and we want to dictate what is good and what is wrong, 
not the other way round. This one way learning, the system learns from the environment, 
and not conversely, neither prevents the system from becoming more intelligent than the 
environment, nor does it prevent the environment learning from the system because the 
environment can itself interpret the outputs yk as a regular and a credit part. The envi- 
ronment is just not forced to learn, whereas the system is. In cases where we restrict the 
credit to two values ceC = IB := {0, 1}, c=l is interpreted as a positive feedback, called 
good or correct and c = a negative feedback, called bad or error in the following. Further, 
let us restrict for a while the lifetime (number of cycles) T of the system to a large, but 
finite value. Let Ckm{p, g) ■ = J2'iLk c{xi) be the total credit, the system p receives from the 
environment q in the cycles k to m. It is now natural to call the system, which maximizes 
the total credit Cit, called utility, the best or most intelligent one^. 

p*'^'^ = maxarg CiTip, q) CkT{p*'^'\ q) > Ckrip, q) Vp 

p 

For k = l this is obvious and for k>l easy to see. If T, Y and X are finite, the number of 
different behaviours of the system, i.e. the search space is finite. Therefore, because we 
have assumed that q is known, p*'^'^ can effectively be determined (by pre-analyzing all 
behaviours). The main reason for restricting to finite T was not to ensure computability of 
p*,T,q Y^y^^ iliaX the limit T— oo might not exist. This is nothing special, the (unrealistic) 
assumption of a completely known deterministic environment q has simply trivialized 
everything. 

•^maxargp C{p) is the p which maximizes C(-). If there is more than one maximum we might choose 
the lexicographically smallest one for definiteness. 



2 THE Alfi MODEL IN FUNCTIONAL FORM 



9 



AI model for known prior probability: Let us now weaken our assumptions by 
replacing the environment q with a probability distribution fi{q) over chronological func- 
tions, fi might be interpreted in two ways. Either the environment itself behaves in a 
probabilistic way defined by /i or the true environment is deterministic, but we only have 
probabilistic information, of which environment being the true environment. Combina- 
tions of both cases are also possible. The interpretation does not matter in the following. 
We just assume that we know fi but no more about the environment whatever the inter- 
pretation may be. 

Let us assume we are in cycle k with history yki-.-yxk-i and ask for the best output Uk- 
Further, let Qk'-={q '■ l{il<k) = i<A:} be the set of all environments producing the above 
history. The expected credit for the next m—k+1 cycles (given the above history) is given 
by a conditional probability: 

We cannot simply determine maxargp(Cir) unlike the deterministic case because the 
history is no longer deterministically determined by p and g, but depends on p and /i 
and on the outcome of a stochastic process. Every new cycle adds new information (xj) 
to the system. This is indicated by the dots over the symbols. In cycle k we have to 
maximize the expected future credit, taking into account the information in the history 
ipc^k- This information is not already present in p and g//i at the system's start unlike in 
the deterministic case. 

Further, we want to generalize the finite lifetime T to a dynamical (computable) farsight- 
edness hk = mk—k+l>l, called horizon in the following. For mk = T we have our original 
finite lifetime, for mk = k+m—l the system maximizes in every cycle the next m expected 
credits. A discussion of the choices rrik is delayed to section ^. 

The next hk credits are maximized by 

pI := maxargC^^^(p|^<fc), 

where Pk'-={p '■ p{i<k) = y<k*} is the set of systems consistent with the current history. 

depends on k and is used only in step k to determine yk by pKx^k'i y<k) = y<kyk- After 
writing i/k the environment replies with Xk with (conditional) probability niQk+i) / l^iQk)- 
This probabilistic outcome provides new information to the system. The cycle k+1 starts 
with determining yk+i from p*kj^i (which differs from pk as Xk is now fixed) and so on. 
Note that depends also on ij^k because Pk and Qk do so. But recursively inserting p*k_i 
and so on, we can define 

P*{x<k) ■■= p*k{x<k;p*k^i{x<k~i--P*i))) (2) 

It is a chronological function and computable if X, y and rrik are finite. The policy p* 
defines our Alfi model. For deterministic^ /i this model reduces to the deterministic case. 

"^We call a probability distribution deterministic if it is 1 for exactly one argument and for all others. 
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It is important to maximize the sum of future credits and not, for instance, to be greedy 
and only maximize the next credit, as is done e.g. in sequence prediction. For example, 
let the environment be a sequence of chess games and each cycle corresponds to one move. 
Only at the end of each game a positive credit c = 1 is given to the system if it won the 
game (and made no illegal move). For the system, maximizing all future credits means 
trying to win as many games in as short as possible time (and avoiding illegal moves). 
The same performance is reached, if we choose mk = k + m with m much larger than the 
typical game lengths. Maximization of only the next credit would be a very bad chess 
playing system. Even if we would make our credit c finer, e.g. by evaluating the number 
of chessmen, the system would play very bad chess for m = l, indeed. 

The Al/i model still depends on fj, and mk- rrik is addressed in section ^. To get our final 
universal AI model the idea is to replace /x by the universal probability ^, defined later. 
This is motivated by the fact that ^ — /i in a certain sense for any /i. With ^ instead of 
/i our model no longer depends on any parameters, so it is truly universal. It remains to 
show that it produces intelligent outputs. But let us continue step by step. In the next 
section we develop an alternative but equivalent formulation of the AI model given above. 
Whereas the functional form is more suitable for theoretical considerations, especially for 
the development of a timebounded version in section |10|, the iterative formulation of the 
next section will be more appropriate for the explicit calculations in most of the other 
sections. 
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3 The Al/i Model in Recursive and Iterative Form 

Probability distributions: Throughout the paper we deal with sequences/strings and 
conditional probability distributions on strings. Some notations are therefore appropriate. 

We use Greek letters for probability distributions and underline their arguments to indi- 
cate that they are probability arguments. Let Pn{xi...Xj^) be the probability that a string 
starts with xi...Xn- We only consider sufficiently long strings, so the p„ are normalized to 
1. Moreover, we drop the index on p if it is clear from its arguments: 

J2 Pfe:n) = J2pnUl:n) = P™-l(^<n) = Pfe<n), P(e) = Po(e) = 1- (3) 

We also need conditional probabilities derived from Bayes' rule. We prefer a notation 
which preserves the chronological order of the words, in contrast to the standard notation 
p(- 1 ■) which flips it. We extend the definition of p to the conditional case with the following 
convention for its arguments: An underlined argument probability variable and 

other non-underlined arguments represent conditions. With this convention, Bayes' 
rule has the form p{x^nXn) = p(^i:n)/p(^<n)- The equation states that the probability 
that a string is followed by x„ is equal to the probability of divided by 

the probability of We use x* as a shortcut for 'strings starting with x\ 

The introduced notation is also suitable for defining the conditional probability 
p{yi2Li---ynXn) that the environment reacts with under the condition that the out- 

put of the system is yi...yn- The environment is chronological, i.e. input depends 
on ifc^iyi only. In the probabilistic case this means that p{isi^j.yk) := Y^x^, piM-i.k) is in- 
dependent of yki hence a tailing y^ in the arguments of p can be dropped. Probability 
distributions with this property will be called chronological. The y are always conditions, 
i.e. never underlined, whereas additional conditioning for the x can be obtained with 
Bayes' rule 

P{WC <nmn) = P{mi:n)/P{m<n) and 

(4) 

p{mi:n) = p{mi)-p{wim2)- ■■■■p{w<nmn) 

The second equation is the first equation applied n times. 

Alternative Formulation of the Al/i Model: Let us define the AI/x model p* in 
a different way. In the next subsection we will show that the p* model defined here is 
identical to the functional definition of p* given in the last section. 

Let p{yic_i.i^) be the true chronological prior probability that the environment reacts with 
Xi-k if provided with actions yi^k from the system. We assume the cybernetic model 
depicted on page ^ to be valid. Next we define C^+i miv^i-k) to be the p, expected credit 
sum in cycles k + 1 to m with outputs yi generated by system p* and past responses Xi 
from the environment. Adding c{xk) we get the credit including cycle k. The probability 
of Xk, given yi^kyk, is given by the condition probability p{yx<kWik)- So the expected 
credit sum in cycles k to m given yi:<:kyk is 

C*km{w<kyk) ■■= IZ[c(xfc) +Cfc+i^^(i/ri:fc)]-/x(yr<fc?^fc) (5) 
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Now we ask about how p* chooses i/k- It should choose yk as to maximize the future 
credit. So the expected number of errors in cycles k to m given ifc^k and yk chosen by p* 
is C^m{w^<k) • = HiaXyj. Cl^iyx^^kUk)- Together with the induction start 

C*m+l,n,iMCl:m) ■= (6) 

Ckm is completely defined. We might summarize one cycle into the formula 

C*km{wc<k) = maxJ2[c{xk) + Cl_^_^„^{yjci.,k)]-fi{yic<kmk) (7) 

If rrifc is our horizon function of p* and jH;<fc is the actual history in cycle k, the output 
ijk of the system is explicitly given by 

Vk = maxargC*„^(?)r<fcyfe) =: p*{yx<k) (8) 

Vk 

Then the environment responds Xk with probability fi{yi<kiKk)- Then cycle k+1 starts. 
We might unfold the recursion further and give ijk non-recursive as 

yk = maxarg^max^ ... max ^(c(a;fc)+ ... +c(a;„J)-/i(?)E<fc?^fc,^J (9) 

Xk ^'■+^ x^+^ 

This has a direct interpretation: the probability of inputs Xk-.m^ in cycle k when the system 
outputs yk:mk ^^d the actual history is ifc^k is ^{yi:<:klSik:mk)- The future credit in this 
case is c{xk)+ ... +c{xmk)- The best expected credit is obtained by averaging over the 
Xi {surrix^) and maximizing over the yi. This has to be done in chronological order to 
correctly incorporate the dependency of Xi and yi on the history. This is essentially the 
expectimax algorithm/sequence The AI/i model is optimal in the sense that no 

other policy leads to higher expected credit. 

These explicit as well as recursive definitions of the Al/i model are more index intensive 
as compared to the functional form but are more suitable for explicit calculations. 



Equivalence of Functional and Iterative AI model: The iterative environmental 
probability fi is given by the functional form in the following way, 

Kmi:k) = J2 M (10) 

q-q(yi:k)=xi:k 

as is easy to see. We will prove the equivalence of (0) and (|^) only for k = 2 and m2 = 3. 
The proof of the general case is completely analog except that the notation becomes quite 
messy. 

Let us first evaluate (|I]) for fixed yiXi and some p G -P2, i.e. p{xi) = yiy2 for some ?/2- 
If the next input to the system is X2, p will respond with p{xiX2) = yiy2yz for some y^ 
depending on X2- We write y'i{x2) in the following^. The numerator of (|T]) simplifies to 

I] ^(9)C23(P,g) = I] /^(g)C23(P,g) = I](C(X2)+C(X3)) ^ = 

ggQj q:q(yi)=xi X2X3 q:q{yiy2y3(x2))=xiX2X3 

''Dependency on dotted words like xi is not shown as the dotted words are fixed. 
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^ (C(X2) +C(X3)) ■/i(?/iil?/2^2?/3(a;2)^3) 



In the first equahty we inserted the definition of Q2- In the second equality we split the 
sum over q by first summing over q with fixed X2X3. This allows us to pull C23 = c{x2)-\-c{x3) 
out of the inner sum. Then we sum over X2X3. Further, we have inserted p, i.e. replaced 
p by y2 and y3{-). In the last equality we used ([10|) . The denominator reduces to 

H M = H M = ^(Mi)- 

For the quotient we get 

C23{p\yiXi) = 5](c(^2) + c(x3))-yu(j^ii;iy2^2?/3(a;2)^) 

2:22:3 

We have seen that the relevant behaviour of p G P2 in cycle 2 and 3 is completely deter- 
mined by ?/2 and the function 7/3 (■) 

maxC23(p|yiii) = maxmax V (c(x2) + c(x3))-;u(j/iXi?/2^2y3(a;2)c3) = 
= max max ^(c(x2) + c(x3)) ■/i(yiXit/2^2l/3^3) 

y^ X2 X3 

In the last equality we have used the fact that the functional minimization over 1/3 (■) 
reduces to a simple minimization over the word when interchanging with the sum 
over its arguments (maXy^(^.)J2x2 = 1^2:2 ^^^ys ) • ^^e functional case y2 is therefore 
determined by 



y2 = maxarg^max^(c(x2)+c(x3))-/i(yii;i?/2^2y3^3 

X2 X3 

This is identical to the iterative definition (0) with k = 2 and m2 = 3 □. 



Factorizable jj,: Up to now we have made no restrictions on the form of the prior 
probability h apart from being a chronological probability distribution. On the other 
hand, we will see that, in order to prove rigorous credit bounds, the prior probability 
must satisfy some separability condition to be defined later. Here we introduce some very 
strong form of separability, when /i factorizes into products. We start with a factorization 
into two factors. Let us assume that /i is of the form 

Kmi:n) = f^l{m<l) ■ f^2{m:n) (H) 

for some fixed / and sufficiently large n>mk. For this ^ the output ijk in cycle k of the 
Al/i system @) for k>l depends on yitk-i and fi2 only and is independent of yr</ and 
Hi. This is easily seen when inserting 

^^{W<k'mk■.mk) = f^liiP^<l) ■f^2{m:k-l'mk:mj (12) 

=1 
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into (Ip. For k < I the output yk depends on yi<:k (this is trivial) and /ii only (trivial if 
mk<l) and is independent of ^2- The non-trivial case, where the horizon mk>l reaches 
into the region /i2, can be proved as follows (we abbreviate m := rrik in the following). 
Inserting ([TT|) into the definition of C;*^(?/r</) the factor /ii is 1 as in ([I^). We abbreviate 
C;*^:=Cf„(yr<i) as it is independent of its arguments. One can decompose 

Clm{w;<k) = Cli_^{jfi:<k) + (13) 

For k = I this is true because the first term on the r.h.s. is zero. For k <l we prove the 
decomposition by induction from k + 1 to k. 

xk 



max 

Vk 



^(c(xfc) + Cl^^i_^{uc<k))-fiiiw:<kmk) + 
■ Xk 

= Ck,l-l{lP^<k) + Clm 

Inserting (|l^), valid for k by induction hypothesis, into (0) gives the first equality. In 
the second equality we have performed the Xk sum for the Cf^ ■ Hi term which is now 
independent of yk- It can therefore be pulled out of max^^. In the last equality we used 
again the definition (|^. This completes the induction step and proves (|13]) for k<l. yk 
can now be represented as 

ijk = maxaig Cl^{ifb<kyk) = maxarg C* ;_^(^<fe?/fc) (14) 

Vk Vk 

where (P) and (jl^) and the fact that an additive constant Cf^ does not change maxarg^^ 
has been used. C'^^i-i{wc<,kyk) and hence yk is independent of for k<l. Note, that yk 
is also independent of the choice of m, as long as m>l. 

In the general case the cycles are grouped into independent episodes r = l,2,3, where 
each episode r consists of the cycles fc = nr+l, n^+i for some Q = < rii < ... < Ug = n: 

s-l 

/i(2ai:„) = X{l^riMnr+l:n,+i) (15) 
r=0 

In the simplest case, when all episodes have the same length / then Ur = r-l. ijk depends 
on and x and y of episode r only, with r such that rir <k<nr+i. 



ijk = maxarg^...max^(c(a;fc)+ ... +c{xt)) ■ Urimnr+i-.k-iy^k-.ur+i) (16) 

y" Xk ^* xt 

with t : = min{mfc, n^+i}. The different episodes are completely independent in the follow- 
ing sense. The inputs Xk of different episodes are statistically independent and depend 
only on yk of the same episode. The outputs yk depend on the x and y of the corresponding 
episode r only, and are independent of the actual I/O of the other episodes. 

If all episodes have a length of at most I, i.e. rir+i — rir < I and if we choose the horizon 
hk to be at least /, then rrik >k + l — l>nr + l> n^+i and hence t = n^+i independent of 
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mfc. This means that for factorizable /i there is no problem in taking the hmit mk—^oo. 
Maybe this hmit can also be performed in the more general case of a separable /i. The 
(problem of the) choice of rrik will be discussed in more detail later. 

Although factorizable /i are too restrictive to cover all AI problems, it often occurs in 
practice in the form of repeated problem solving, and hence, is worth being studied. For 
example, if the system has to play games like chess repeatedly, or has to minimize dif- 
ferent functions, the different games/functions might be completely independent, i.e. the 
environmental probability factorizes, where each factor corresponds to a game/function 
minimization. For details, see the appropriate sections on strategic games and function 
minimization. 

Further, for factorizable it is probably easier to derive suitable credit bounds for the 
universal AI^ model defined in the next section, than for the general separable case which 
will be introduced later. This could be a first step toward a definition and proof for the 
general case of separable problems. One goal of this paragraph was to show, that the 
notion of a factorizable /x could be the first step toward a definition and analysis of the 
general case of separable fi. 

Constants and Limits: We have in mind a universal system with complex interactions 
that is as least as intelligent and complex as a human being. One might think of a system 
whose input i/k comes from a digital video camera, the output Xk is some image to a 
monitoi]^, only for the valuation we might restrict to the most primitive binary one, i.e. 
Ck e IB. So we think of the following constant sizes: 

1 < {KvkXk)) < k < T < |FxX| 
1 < 2^^ < 2^^ < < 2^^^^^ 

The first two limits say that the actual number k of inputs/outputs should be reasonably 
large, compared to the typical size (/) of the input/output words, which itself should be 
rather sizeable. The last limit expresses the fact that the total lifetime T (number of I/O 
cycles) of the system is far too small to allow every possible input to occur, or to try 
every possible output, or to make use of identically repeated inputs or outputs. We do 
not expect any useful outputs for k < (/) . More interesting than the lengths of the inputs 
is the complexity K[x\...Xk) of all inputs until now, to be defined later. The environment 
is usually not "perfect". The system could either interact with a non-perfect human or 
tackle a non-deterministic world (due to quantum mechanics or chaos) world[]. In either 
case, the sequence contains some noise, leading to ~ (/) ■ k. The complexity of the 
probability distribution of the input sequence is something different. We assume that 
this noisy world operates according to some simple computable, though not finite rules. 
K{fik) {l)-k, i.e. the rules of the world can be highly compressed. On the other hand, 
there may appear new aspects of the environment for — > cxd causing a non-bounded 
Kii^k). 

^Humans can only simulate a screen as output device by drawing pictures. 

^Whether there exist stochastic processes at all is a difficult question. At least the quantum indeter- 
minacy comes very close to it. 



3 THE Alfi MODEL IN RECURSIVE AND ITERATIVE FORM 



16 



In the following we never use these limits, except when explicitly stated. In some simpler 
models and examples the size of the constants will even violate these limits (e.g. l{xk) = 
KVk) = 1); but it is the limits above that the reader should bear in mind. We are only 
interested in theorems which do not degenerate under the above limits. 



Sequential decision theory: In the following we clarify the connection of (|^ and (P) to 
sequential decision theory and discuss similarities and differences. With probability M^", 
the system under consideration should reach (environmental) state i E S when taking 
action a^A depending on the current state j&S. If the system receives reward R{i), the 
optimal policy p*, maximizing expected utility (defined as sum of future rewards), and 
the utility U{i) of policy p* are 

p*{i) = maxargj] M;!.[/(j) , U{i) = R{t) + maxj] MW(j) (17) 
3 " j 



See I^TI for details and further references. Let us identify 

S = {Y^X)\ A = Y, a = yk, Mf^ = ^i{yi^kWik), 
i = ys^k, ^(«) = c{xk-i), U{i) = Cl_^„^{yi:<k) = c{xk-i) + Cl^{yc<k): 
j = yici.k, R{j) = c(xfc), U{j) = Cl^iw^i.k) = c(xfc) + Q+i,™(p;i:fc), 

where we further set M°- = if i is not a starting substring of j or if a ^ Uk- This 
ensures the sum over j in (|1^ to reduce to a sum over Xk- If we set rrik = m and use 
CkmilP^<kyk) = J2x^: Ckmiw^^-k) 1^ (i); it is casy to scc that (plTl) coincides with (^ and (H). 

Note that despite of this formal equivalence, we were forced to use the complete history 
yic^k as environmental state i. The Alfi model neither assumes stationarity, nor Markov 
property, nor complete accessibility of the environment, as any assumption would restrict 
the applicability of Alfi. The consequence is that every state occurs at most once in 
the lifetime of the system. Every moment in the universe is unique! Even if the state 
space could be identified with the input space X, inputs would usually not occur twice by 
assumption k<^ \X\, made in the last subsection. Further, there is no (obvious) universal 
similarity relation on (XxY)* allowing an effective reduction of the size of the state space. 
Although many algorithms (e.g. value and policy iteration) have problems in solving ([T7|) 
for huge or infinite state spaces in practice, there is no principle problem in determining 
p* and U, as long as fi is known and |y| and m are finite. 



Things dramatically change if fi is unknown. Reinforcement learning algorithms |]T5| are 
commonly used in this case to learn the unknown fi. They succeed if the state space 
is either small or has effectively been made small by so called generalization techniques. 
In any case, the solutions are either ad hoc, or work in restricted domains only, or have 
serious problems with state space exploration versus exploitation, or have non-optimal 
learning rate. There is no universal and optimal solution to this problem so far. In the 
next section we present a new model and argue that it formally solves all these problems 
in an optimal way. It will not concern with learning of fi directly. All we do is to replace 
the true prior probability /i by a universal probability ^, which is shown to converge to fx 
in a sense. 
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4 The Universal AI^ Model 

Induction and Algorithmic Information theory: One very important and highly 
non-trivial aspect of intelligence is inductive inference. Before formulating the AI^ model, 
a short introduction to the history of induction is given, culminating into the sequence 
prediction theory by Solomonoff. We emphasize only those aspects which will be of 
importance for the development of our universal AI^ model. 

Simply speaking, induction is the process of predicting the future from the past or, more 
precisely, it is the process of finding rules in (past) data and using these rules to guess 
future data. On the one hand, induction seems to happen in every day life by finding 
regularities in past observations and using them to predict the future. On the other hand, 
this procedure seems to add knowledge about the future from past observations. But how 
can we know something about the future? This dilemma and the induction principle in 
general have a long philosophical history 

• Hume's negation of Induction (1711-1776) ||T2| , 

• Epicurus' principle of multiple explanations (3427-270? BC), 

• Occams' razor (simplicity) princple (12907-1349?), 

• Bayes' rule for conditional probabilites 

and a short but important mathematical history: a clever unification of all these aspects 



into one formal theory of inductive inference has been done by Solomonoff based on 



Kolmogorov's [|T7| definition of complexity. For an excellent introduction into Kolmogorov 



complexity and Solomonoff induction one should consult the book of Li and Vitanyi |24 



In the rest of this subsection we state all results which are needed or generalized later. 

Let us choose some universal prefix Turing machine U with unidirectional binary input 
and output tapes and a bidirectional working tape. We can then define the (prefix) 



Kolomogorov complexity |TT], |T9[ as the shortest prefix program p, for which U 

outputs X = Xi:n with Xi &IB: 

K{x) := min{Z(p) : U{p) = x} 

The universal semimeasure ^(x) is defined as the probability that the output of the uni- 
versal Turing machine U starts with x when provided with fair coin flips on the input 
tape |3^. It is easy to see that this is equivalent to the formal definition 



e(x) := 2-'^^^ (18) 

p : U {p)=x* 

where the sum is over minimal programs p for which U outputs a string starting with x. 
U might be non-terminating. As the shortest programs dominate the sum, ^ is closely 
related to K(x) {^{x) = 2~-'^(^)+'^(^('(^))). ^ has the important universality property p5| . 



that it majorizes every computable probability distribution p up to a multiplicative factor 
depending only on p but not on x: 

e(x) > 2-^(^)-p(s). (19) 
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A 'x ' above an (in)equality denotes (in)equality within a universal multiplicative constant, 
a '+' above an (in)equality denotes (in)equality within a universal additive constant, 
both depending only on the choice of the universal reference machine U. ^ itself is not 
a probability distribution^. We have ^ (xO) + ^ ( xl ) < ^ (x) because there are programs 
p, which output just x, neither followed by nor 1. They just stop after printing x 
or continue forever without any further output. We will call a function p > with the 
properties p(e) < 1 and J2xn Pisii-.n) ^ p{.3i<n) ^ semimeasure. ^ is a semimeasure and ([T9|) 
actually holds for all enumerable semimeasures p. 

(Binary) sequence prediction algorithms try to predict the continuation x„ of a given 
sequence xi...Xn-i- In the following we will assume that the sequences are drawn according 
to a probability distribution and that the true prior probability of xi:n is fi{ xi...Xn )- 
The probability of x„ given x^n hence is p(a;<„x„). The best possible system predicts 
the Xn with higher probability. Usually /i is unknown and the system can only have 
some belief p about the true prior probability p. Let SPp be a probabilistic sequence 
predictor, predicting x„ with probability p{x^nXn)- Further we define a deterministic 
sequence predictor SP9p predicting the Xn with higher p probability. 6p(a;<na;„) := 1 if 
p{.x^n3Ln) > I and Qp{x^nXn) '■= otherwise. If p is only a semimeasure the SPp and 
SP9p systems might refuse any output in some cycles n. The SP9^ is the best prediction 
scheme when p is known. 

If p{x^nXn) converges quickly to p{x^nXn) the number of additional prediction errors 
introduced by using 6p instead of 6^ for prediction should be small in some sense. Now 
the universal probability ^ comes into play as it has been proved by Solomonoff PB| that 
the p expected Euclidean distance betweewn ^ and p is finite 

J2J2p(^i-k)i^i^<kXk) - pix<kXk)f < |ln2-K(p) (20) 

fc=i ^i-.k 

The '+' atop '<' means up to additive terms of order 1. So indeed the difference does tend 
to zero, i.e. ^(x<„x„) '^-^ p(a^<n2in) with p probability 1 for any computable probability 
distribution p. The reason for the astonishing property of a single (universal) function 
to converge to any computable probability distribution lies in the fact that the set of 



p random sequences differ for different p. The universality property (|1^) is the central 
ingredient for proving {^^. 

Let us define the total number of expected erroneous predictions the SPp system makes 
for the first n bits 

n 

^np ■■= I] I]p(^i:fc)(l-p(a;<fcXfc)) (21) 

k = l Xx:k 

The SP9p system is best in the sense that -Ene^ ^Enp for any p. In it has been shown 
that SP0^ is not much worse 



Ene.-E^p < H+JAE^pH + H^ = OUEnp) , if < ln2-i^(p) (22) 



8 



It is possible to normalize ^ to a probability distribution as has been done in M% hsi by giving 



up the enumerability of ^. Error bounds (|20| ) and ( |22[ ) hold for both definitions 
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with the tightest bound for p = 6^. For finite -Eooe^, E^oe^ is finite too. For infinite -Eooe^, 
EnB^/EnB^ 1 with rapid convergence. One can hardly imagine any better prediction 
algorithm without extra knowledge about the environment. In fl^, (^OD and (^) have 
been generalized from binary to arbitrary alphabet. Apart from computational aspects, 
which are of course very important, the problem of sequence prediction could be viewed 
as essentially solved. 

Definition of the AI^ Model: We have developed enough formalism to suggest our 
universal AI^ model^. All we have to do is to suitably generalize the universal semimeasure 
^ from the last subsection and replace the true but unknown prior probability fi^^ in the 
Al/i model by this generalized . In what sense this AI^ model is universal will be 
discussed later. 

In the functional formulation we define the universal probability of an environment q 
just as 2-'('?) 

e(g) := 2-'(^) 

The definition could not be easier[^!|^ Collecting the formulas of section ^ and replacing 
fi{q) by ^{q) we get the definition of the AI^ system in functional form. Given the history 
yi^k the functional AI,^ system outputs 

yk := maxarg max ^ 2'^^'^^ ■ CkmAv^O) (23) 

in cycle fc, where Cknikip^) is the total credit of cycles k to when system p inter- 
acts with environment q. We have dropped the denominator ^q^{q) from (|T]) as it is 
independent of the pEPk and a constant multiplicative factor does not change maxarg. 

For the iterative formulation the universal probability ^ can be obtained by inserting the 
functional (q) into (p!0|) 

ami:k) = E 2-'(^) (24) 

<l-q(yi:k)=Xl:k 

Replacing yu by ^ in the iterative AI^ system outputs 

yk = maxarg^maxE ••• max ^ (c(a;fc) + ... +c(a;„J) ■^(jtc<fc|£fc:^J (25) 

in cycle k given the history ific^^k- 

One subtlety has been passed over. Like in the SP case, ^ is not a probability distribution 
but satisfies only the weaker inequalities 

E^(lSl:n) < ^im<n) , ^ < 1 (26) 

Xn 

^Speak 'aixi' and write AIXI without Greek letters. 

^"^It is not necessary to use 2^^'^^ or something similar as some reader may expect at this point. The 
reason is that for every program q there exists a functionally equivalent program q' with K{q') — l{q'). 

^^Here and later we identify objects with their coding relative to some fixed Turing machine U . For 
example, if g is a function K(c[) := ^([g] ) with \q~\ being a binary coding of q such that U{\q~\,y) := q{y)- 
On the other hand, if q already is a binary string we define q{y) := U{q, y). 
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Note, that the sum on the l.h.s. is not independent of i/n unhke for chronological probability 
distributions. Nevertheless, it is bounded by something (the r.h.s) which is independent of 
yn- The reason is that the sum in (2|) runs over (partial recursive) chronological functions 
only and the functions q which satisfy q{yi:n) subset of the functions satisfy- 

ing q{y<:n) = x^n- Therefore we will in general call functions satisfying (|26| ) chronological 
semimeasures. The important point is that the conditional probabilities (^) are < 1 like 
for true probability distributions. 

The equivalence of the functional and iterative AI model proven in section ^ is true for 
every chronological semimeasure p, esp. for ^, hence we can talk about the AI^ model 
in this respect. It (slightly) depends on the choice of universal Turing machine. l{q) is 
defined only up to an additive constant. It also depends on the choice oi X = C x X' 
and y, but we do not expect any bias when the spaces are chosen sufficiently simple, e.g. 
all strings of length 2^^. Choosing IN as word space would be optimal, but whether the 
maxima (suprema) exist in this case, has to be shown beforehand. The only non-trivial 
dependence is on the horizon function which will be discussed later. So apart from 
and unimportant details the AI^ system is uniquely defined by ( [23| ) or (pS]). It doesn't 
depend on assumptions about the environment apart from being generated from some 
computable (but unknown!) probability distribution. 



Universality of S,^^: In which sense the AI^ model is optimal will be clarified later. 
In this and the next two subsections we show that S,^^ defined in (p^ is universal and 
converges to analog to the SP case ([T9|) and (|20|) . The proofs are generalizations from 
the SP case. The y are pure spectators and cause no difficulties in the generalization. The 
replacement of the binary alphabet IB used in SP by the (possibly infinite) alphabet X 
is possible, but needs to be done with care. In (^) U {p) = x* produces strings starting 
with X, whereas in (^) we can demand q to output exactly n words knows n 

from the number of input words ..?/„. For proofs of (^) and ( PO] ) see and 



There is an alternative definition of ^ which coincides with (|2J) within a multiplicative 
constant of 0(1), 

ete:„) = E2"''^''Vte:J (27) 
P 

where the sum runs over all enumerable chronological semimeasures. The 2~^^p^ weighted 
sum over probabilistic environments p, coincides with the sum over 2^''^'^^ weighted de- 
terministic environments q, as will be proved below. In the next subsection we show 
that an enumeration of all enumerable functions can be converted into an enumeration of 
enumerable chronological semimeasures p. K{p) is co-enumerable, therefore ^ defined in 
(P7|) is itself enumerable. The representation (|2^) is also enumerable. As Y,p 2~^^p^ < 1 
and the p's satisfy (|26|), is a chronological semimeasure as well. If we pick one p in ( |27| ) 
we get the universality property "for free" 

ami:n) > 2-^('')p(^i^„) (28) 



^ is a universal element in the sense of (|28|) in the set of all enumerable chronological 
semimeasures. 
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To prove universality of ^ in the form (p^ we have to show that for every enumerable 
chronological semimeasure p there exists a Turing machine T with 

P(2^i:n) = E 2-'(^) and /(T) ^ K{p). (29) 

This will not be done here. Given T the universality of C, follows from 

am:n) = E 2-'(^) > E 2-'(^^') = 2-'(^) E 2-'(^') = 2-^('')p(2^, J 

q:U{qyi:„)=xi:„ q:U{Tq'yi:n)=xi:n q-T{q'yi:n)=xi:„ 

The first equality and (^) are identical by definition. In the inequality we have restricted 
the sum over all g to g of the form q = Tq' . The third relation is true as running U on Tz 
is a simulation of T on z. The last equality follows from (^). All enumerable, universal, 
chronological semimeasures coincide up to a multiplicative constant, as they mutually 



dominate each other. Hence, definitions (|23) and (p^) are, indeed, equivalent 



Converting general functions into chronological semi-measures: To complete 
the proof of the universality (P^ ) of ^ we need to convert enumerable functions ip : IB* — > 
IR^ into enumerable chronological semi-measures p : (YxX)* —>■ M'^ with certain addi- 
tional properties. Every enumerable function like ip and p can be approximated from below 
by definition^ by primitive recursive functions ip : lB*xlN —yQ^ and (f) : {YxX)*xIN 
with ip{s) = sup^ip{s,t) and p{s) = sup^0(s,t) and recursion parameter t. For arguments 
of the form s = ifici;n we recursively (in n) construct (p from ip as follows: 

If +\ i V{Wl:n,t) for Xn<t ,/ ,x / ,x /on^ 

(7/Ei^„,t) := I ^ for a;„>t ' ^ ^ ^ ^ 

0(e,t) := max {y?'(e,i) : ¥?'(e,i) < l} (31) 

(P{mi:n,t) := max {(^'(yri:„,z) : Ea;„¥''(2/2^1:n,0 < 0(2^<n,^)} (32) 

With < t we mean that the natural number associated with string x„ is smaller than 
t. According to (^) with (p also p' as well as J2x„ 'P' cire primitive recursive functions. 
Further, if we allow t = we have p'{s, 0) = 0. This ensures that is a total function. 

In the following we prove by induction over n that is a primitive recursive chrono- 
logical semimeasure monotone increasing in t. All necessary properties hold for n = 
(2yKi:0 = e) according to (^Tj). For general n assume that the induction hypothesis is true 
for (j){y]c_^^,t). We can see from (|32D that 0(jiri.„, t) is monotone increasing in t. is 
total as p'{ip:i;n,i = 0) = satisfies the inequality. By assumption 0(?/r<„,t) is primitive 
recursive, hence with Y^xn ^' ^^^^ order relation Yp' <<p is primitive recursive. This 
ensures that the non-empty finite set : I] v^' < 0}j and its maximum <^{x^Y:ni t) are 
primitive recursive. Further, 0(j2;i.„, t) = y9'(?/ri:„, i) for some i with i<t independent of 

""^^Defining enumerability as the supremum of total primitive recursive functions is more suitable for our 
purpose than the equivalent definition as a limit of monotone increasing partial recursive functions. In 
terms of Turing machines, the recursion parameter is the time after which a computation is terminated. 



4 THE UNIVERSAL AI^ MODEL 



22 



Xn- Thus, Ex„ (Pimi-.n, t) = Ex„ ^'{vxi-.n, i) < 4>{m<n, t) which is the condition for being 
a chronological semimeasure. Inductively we have proved that is indeed a primitive 
recursive chronological semimeasure monotone increasing in t. 

In the following we show that every (total)|^ enumerable chronological semimeasure p can 
be enumerated by some 0. By definition of enumerability there exist primitive recursive 
functions (p with p(s) = sup^ip{s,t). The function (p{s,t) := (1 — Yt) ■ maxj<t (^(s, z) also 
enumerates p but has the additional advantage of being strictly monotone increasing in t. 

(p\yci,n,oo)=ip{y]Ci.,n,oo)=p{yxi,n) by definition (f){e,t)=ip'{e,t) by (|l|) and the fact 
that Lp'{e, < ip'{e, i) < Lp{e, i)<p{e) < 1, hence 0(e, cx)) = p(e). (pimi-.n, t) < ^'iw^i-.n, t) by 
(^), hence 0(jiri.„, oo) <p(j^i:„). We prove the opposite direction (p^y^i.^, oo) >p(yri:n) 
by induction over n. We have 

J^f'iW^l-.n^i) < J2v{W:i:n,i) < J2 ViW^hn, Oo) = ^ p(z/ri:„) < pimKn) (33) 

Xji Xfi, Xfi Xfi 

The strict monotony of (f and the semimeasure property of p have been used. By in- 
duction hypothesis limj^oo 0(jQi<n5 ^) ^ piwiKn) (PU) sufficiently large t we have 



(t>{w<nit) > J^xn^'iw^i-.nji)- The condition in (^) is, hence, satisfied and therefore 
'PiwHi-.m t) ^ ^'{Wi-.n, i) for sufficiently large t, especially (p^-y^i-n, oo) > ip'iipci.n, i) for all 
Taking the limit i^cx) we get 0(j£i.„, oo) > v?'(2/ri:„, oo) =p(j£i.„). 

Combining all results, we have shown that the constructed 0(-,t) are primitive recursive 
chronological semimeasures monotone increasing in t, which converge to the enumerable 
chronological semimeasure p. This finally proves the enumerability of the set of enumer- 
able chronological semimeasures. 



Convergence of ^ to p : In the following inequality is proved 

2Y,y,{y,-z,f <Y,yM- with E?/^ = l' E^*<1 (34) 

1=1 j=l ^* i=l i=l 

If we identify i = Xk and ?/j = p{ifc<:k1Mik) ^"^^ = i{w<kW-k)^ multiply both sides 
with pIi^^i^), take the sum over a;<fc, then the sum over k and use Bayes' rule 
p{'m<k)-KlP^<kWCk) = f^imi-.k) we get 



'^Y.Y.^^i'i&i:k){f^iy^<k^k) - ^iw^KkXk)) < E E p(i^i;fc) In — - 



fJ'{w:<kXk) 



<kXkj 



(35) 



In the r.h.s. we can replace J^x^.k f^il&i-.k) by J2xi.„ f^iiSLi-.n) the argument of the logarithm 
is independent of Xk+i-.n- The k sum can now be brought into the logarithm and converts 
to a product. Using Bayes' rule (^) for p and ^ we get 

... = EMl^iJlnft^lf^ = EMl^iJln^f^ ^ ln2.K{p) (36) 



13 



Semimeasures are, by definition, total functions. 
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where we have used the universahty property ( pS] ) of ^ in the last step. The main com- 
phcation for generahzing (pO]) to (^,^) was the generahzation of (^4]) from |X| = 2 to 
a general alphabet, the y are, again, pure spectators. This will change when we analyze 
error/credit bounds analog to (|22|) . 

(^,^) shows that the expected squared difference of /i and C, is finite for computable fi. 
This, in turn, shows that ^{yx^kWHk) converges to ^i{ift ^klSLk) for /c— s-cxd with /x probability 
1. If we take a finite product of ^' s and use Bayes' rule, we see that also ^{yi:<kiKk:k+r) 
converges to fidpc^kySLk-.k+r)- More generally, in case of a bounded horizon hk, it follows 
that 

^iW:<km.k:mJ KyX<km.k:mJ if hk = rUk-k+l < hmax < OO (37) 

This gives makes us confident that the outputs yk of the AI^ model could converge 
to the outputs ijk from the Al/i model @, at least for bounded horizon. 

We want to call an AI model universal, if it is /i independent (unbiased, model-free) and 
is able to solve any solvable problem and learn any learnable task. Further, we call a 
universal model, universally optimal, if there is no program, which can solve or learn 
significantly faster (in terms of interaction cycles). As the AI^ model is parameterless, 
^ converges to fi (^7\), the AI/x model is itself optimal, and we expect no other model to 
converge faster to Al/i by analogy to SP (^2]), 

we expect AI^ to be universally optimal. 

This is our main claim. In a sense, the intention of the remaining (sub)sections is to 
define this statement more rigorously and to give further support. 

Intelligence order relation: We define the ^ expected credit in cycles to m of a 
policy p similar to (|ip and (PB|). We extend the definition to programs p^Pk which are 
not consistent with the current history. 

CL(P|^<.) := ^ E 2-'(^^-Ckm{P,<l) (38) 
Q-<i(y<k)=x<k 

The normalization is again only necessary for interpreting Ckm as the expected credit 
but otherwise unneeded. For consistent policies pEPk we define p:=p. For p^Pk, p is a 
modification oip in such a way that its output is consistent with the current history ific^k, 
hence p G Pk, but unaltered for the current and future cycles > k. Using this definition 
of Ckm we could take the maximium over all systems p in (p3|), rather than only the 
consistent ones. 

We call p more or equally intelligent than p' if 

pyp' :^ ^kMijx^k : C7L,(p|yr<,) > Ci^,{p'\w<k) (39) 

i.e. if p yields in any circumstance higher ^ expected credit than p' . As the algorithm p* 
behind the AI^ system maximizes C|^^ we have p*yp for all p. The AI,^ model is hence 
the most intelligent system w.r.t. ^. ^ is a universal order relation in the sense that it 
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is free of any parameters (except m^) or specific assumptions about the environment. A 
proof, that ^ is a rehable intelhgence order (what we beheve to be true), would prove that 
AI^ is universally optimal. We could further ask: how useful is >z for ordering policies 
of practical interest with intermediate intelligence, or how can >z help to guide toward 
constructing more intelligent systems with reasonable computation time. An effective 



intelligence order relation will be defined in section |T0|, which is more useful from a 
practical point of view. 



Credit bounds and separability concepts: The credits C^m associated with the AI 
systems correspond roughly to the negative error measure —Enp of the SP systems. In SP, 
we were interested in small bounds for the error excess Ens^—Enp- Unfortunately, simple 
credit bounds for AI^ in terms of Ckm analog to the error bound (|22| ) do not hold. We 
even have difficulties in specifying what we can expect to hold for AI^ or any AI system 
which claims to be universally optimal. Consequently, we cannot have a proof if we don't 
know what to prove. In SP, the only important property of ^ for proving error bounds 
was its complexity K{fi). We will see that in the AI case, there are no useful bounds in 
terms of K{fj,) only. We either have to study restricted problem classes or consider bounds 
depending on other properties of /i, rather than on its complexity only. In the following, 
we will exhibit the difficulties by two examples and introduce concepts which may be 
useful for proving credit bounds. Despite the difficulties in even claiming useful credit 
bounds, we nevertheless, firmly believe that the order relation ( pQ]) correctly formalizes the 
intuitive meaning of intelligence and, hence, that the AI^ system is universally optimal. 

In the following, we choose mfc = T. We want to compare the true, i.e. fi expected credit 
C^rp of a /i independent universal policy p'"^'^^ with any other policy p. Naively, we might 
expect the existence of a policy p''^'^* which maximizes Cfj^, apart from additive corrections 
of lower order for T — oo 

C^Ap'^^') > C^Ap)~o{...) V^,p (40) 
Note, that the policy p*^ of the AI^ system maximizes 

C^^T by definition {p*^ hp). As Cf^ 
is thought to be a guess of Cfj., we might expect p*'^^* =p*^ to approximately maximize 
Cij,, i.e. (|40|) to hold. Let us consider the problem class (set of environments) {fio, fii} 
with Y = C= {0, 1} and Ck = 5iy^ in environment /Xj. The first output yi decides whether 
you go to heaven with all future credits being 1 (good) or to hell with all future credits 
being (bad). It is clear, that if /ij, i.e. i is known, the optimal policy p*^^ is to output 
yi = i in the first cycle with C^Ap*^') = T . On the other hand, any unbiased policy p^"^^* 
independent of the actual /z either outputs ?/i = 1 or = 0. Independent of the actual 
choice yi, there is always an environment (/i = /ii_yj for which this choice is catastrophic 
{CiAp'^'^'^*) = 0)- No single system can perform well in both environments /xq (^''^d fii. The 
r.h.s. of ( ^0] ) equals T—o{T) for p=p*^. For all p^"^** there is a /i for which the l.h.s. is zero. 
We have shown that no p^^'^* can satisfy (^) for all /i and p, so we cannot expect p*^ to 
do so. Nevertheless, there are problem classes for which (^OD holds, for instance SP and 
CF. For SP, (^) is just a reformulation of ( ^2]) with an appropriate choice for p^^** (which 
differs from p*^, see next section). We expect (^0]) to hold for all inductive problems in 
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which the environment is not influenced^ by the output of the system. We want to call 
these /i, passive or inductive environments. Further, we want to call /i satisfying (^OD with 
pbest_p*£, pggy^dp passive. So we expect inductive fi to be pseudo passive. 

Let us give a further example to demonstrate the difficulties in establishing credit bounds. 
Let C= {0, 1} and \Y\ be large. We consider all (deterministic) environments in which a 
single complex output y* is correct (c=l) and all others are wrong (c = 0). The problem 
class M is defined by 

M:= {/i : /i(?/r<fc?/fcl) = y* eY, = ^ logs } 

There are = |y| such y*. The only way a /i independent policy p can find the correct 
y*, is by trying one y after the other in a certain order. In the first A^ — 1 cycles at 
most, A^ — 1 different y are tested. As there are A^ different possible y*, there is always 
a /i G M for which p gives erroneous outputs in the first A^ — 1 cycles. The number of 
errors are E^p >N — 1= \Y\ = 2^^^*^ = 2^^^^ for this fi. As this is true for any p, it is 
also true for the AI^ model, hence Ef^^ < 2^^^^ is the best possible error bound we can 
expect, which depends on K{fi) only. Actually, we will derive such a bound in section |] 
for SP. Unfortunately, as we are mainly interested in the cycle region k <^ \Y\ = 2^^^^^ 
(see section |^) this bound is trivial. There are no interesting bounds depending on K{fj,) 
only, unlike the SP case for deterministic /i. Bounds must either depend on additional 
properties of /i or we have to consider specialized bounds for restricted problem classes. 
The case of probabilistic /i is similar. Whereas for SP there are useful bounds in terms of 
Ekfj, and K{fi), there are no such bounds for AI^. Again, this is not a drawback of Al,^ 
since for no unbiased AI system the errors/credits could be bound in terms of K{fi) and 
the errors/credits of Alfi only. 

There is a way to make use of gross (e.g. 2^^^'') bounds. Assume that after a reasonable 
number of cycles k, the information x^k perceived by the AI^ system contains a lot of 
information about the true environment fi. The information in x<fc might be coded in 
any form. Let us assume that the complexity K{fi\x^k) of fi under the condition that 
i;<fc is known, is of order 1. Consider a theorem, bounding the sum of credits or of other 
quantities over cycles L..00 in terms of f{K{jj)) for a function / with /(0(1)) = 0(1), like 
f{n) = 2". Then, there will be a bound for cycles /c...oo in terms of f{K{iJi\x<k)) =0(1). 
Hence, a bound like 2^^^^^ can be replaced by small bound 2'^'^^l^<'=) = 0(1) after a rea- 
sonable number of cycles. All one has to show/ensure/assume is that enough information 
about /i is presented (in any form) in the first k cycles. In this way, even a gross bound 
could become useful. In section ^ we use a similar argument to prove that AI^ is able to 
learn supervised. 

In the following, we weaken (^) in the hope of getting a bound applicable to wider 
problem classes than the passive one. Consider the I/O sequence yiXi...ynin caused by 
AI^. On history yt^k, Al^ will output yk = yl in cycle k. Let us compare this to what 
Alfj, would output, still on the same history ip;<:k produced by AI^. As AI/i maximizes 



""^^Of course, the credit feedback Cfe depends on the system's output. What we have in mind is, hke in 
sequence prediction, that the true sequence is not influenced by the system 
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the n expected credit, AI^ causes lower (or at best equal) Cfi^^, if yl differs from y^. Let 
Dn^i^ '■= (Z]fc=i y^)^l be the /i expected number of suboptimal choices of AI^, i.e. 

outputs different from Alfj, in the first n cycles. One might weigh the deviating cases by 
their severity. Especially when the fi expected credits C^^^ for yl and y^ are equal or 
close to each other, this should be taken into account in the definition of D^fj,^. These 
details do not matter in the following qualitative discussion. The important difference 
to (^OD is that here we stick on the history produced by AI^ and count a wrong decision 
as, at most, one error. The wrong decision in the Heaven&Hell example in the first cycle 
no longer counts as losing T credits, but counts as one wrong decision. In a sense, this 
is fairer. One shouldn't blame somebody too much who makes a single wrong decision 
for which he just has too little information available, in order to make a correct decision. 
The AI^ model would deserve to be called asymptotically optimal, if the probability of 
making a wrong decision tends to zero, i.e. if 

Dn^^/n^O for n ^ oo, i.e. D„^g = o{n). (41) 

We say that /i can be asymptotically learned (by AI^) if (^) is satisfied. We claim 
that AI^ (for oo) can asymptotically learn every problem fi of relevance, i.e. AI^ 

is asymptotically optimal. We included the qualifier of relevance, as we are not sure 
whether there could be strange n spoiling (5T) but we expect those n to be irrelevant 



from the perspective of AI. In the field of Learning, there are many asymptotic learnability 



theorems, often not too difficult to prove. So a proof of (^) might also be accessible. 
Unfortunately, asymptotic learnability theorems are often too weak to be useful from a 
practical point. Nevertheless, they point in the right direction. 

From the convergence ( p7| ) of /i ^ ^ we might expect Cj^^ and hence, yl defined in 

(p5|) to converge to y^ defined in (^ with fi probability 1 for k—>-oo. The first problem is, 
that if the Ckm^ for fhs different choices of yk are nearly equal, then even if C^^^ ^C^m^^ 
VkT^Vk is possible due to the non-continuity of maxarg^^^. This can be cured by a weighted 
-Dn^g as described above. More serious is the second problem we explain for hk = l and 
X = C = {0, 1}. For i/i = maxargj^^ ^iip<kykl) to converge to y^ = maxarg^^^ /i(jc<fcl/fcl), it 
is not sufficient to know that ^{if<ki£k) ~^ l^{if<k'Ulk) as has been proved in (pT]). We need 
convergence not only for the true output ijk and credit Cfc, but also for alternate outputs 
yt and credit 1. converges to if ^ converges uniformly to /i, i.e. if in addition to (|37|) 



\i^{w^<ky'kX^k) - i{w<ky'k^k)\ < c^^^{w<kmk) - ii.w<k'm.k)\ ^y'k^'k (42) 



holds for some constant c (at least in some /x expected sense). We call /i satisfying (|42|) 
uniform. For uniform /i one can show (0) with appropriately weighted -D„^g and bounded 
horizon hk < hmax- Unfortunately there are relevant yU which are not uniform. Details will 
be given elsewhere. 

In the following, we briefly mention some further concepts. A Markovian fi is defined 
as depending only on the last output, i.e. niipc ^kULk) = l^kijMik)- We say /x is generalized 
Markovian, if niifc^klBik) = f^kiw^k-i-.k-iKLk) fixed /. This property has some similarities 
to factorizable fi defined in (plSj). If further /i^ = /iiV/c, /i is called stationary. Further, 
for all enumerable fi, iJiiifc ^kWHk) ^"^^ Ojpc <kULk) g^t independent of ?/r</ for fixed / and 
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/c — s> oo with yU probability 1. This property, which we want to call forgetfulness, will be 
proved elsewhere. Further, we say /i is farsighted, if limm^-^oo exists. More details 
will be given in the next subsection, where we also give an example of a possibly relevant 
fi, which is not farsighted. 

We have introduced several concepts, which might be useful for proving credit bounds, 
including forgetful, relevant, asymptotically learnable, farsighted, uniform, (generalized) 
Markovian, factorizable and (pseudo) passive /i. We have sorted them here, approximately 
in the order of decreasing generality. We want to call them separability concepts. The more 
general (like relevant, asymptotically learnable and farsighted) fi will be called weakly 
separable, the more restrictive (like (pseudo) passive and factorizable) /i will be called 
strongly separable, but we will use these qualifiers in a more qualitative, rather than rigid 
sense. Other (non-separability) concepts are deterministic ii and, of course, the class of 
all chronological 

The choice of the horizon: The only significant arbitrariness in the AI^ model lies 
in the choice of the horizon function hk = m}. — k+l. We discuss some choices which seem 
to be natural and give preliminary conclusions at the end. We will not discuss ad hoc 
choices of hk for specific problems (like the discussion in section |] in the context of finite 
games). We are interested in universal choices of m^. 

If the lifetime of the system is known to be T, which is in practice always large but finite, 
then the choice mk = T maximizes correctly the expected future credit. T is usually not 
known in advance, as in many cases the time we are willing to run a system depends 
on the quality of its outputs. For this reason, it is often desirable that good outputs 
are not delayed too much, if this results in a marginal credit increase only. This can be 
incorporated by damping the future credits. If, for instance, we assume that the survival 
of the system in each cycle is proportional to the past credit an exponential damping 
Cfc := c'fc-e"'*'^ is appropriate, where c'^, are bounded, e.g. c'^ e [0, 1]. The expression (p5|) 
converges for mfc^oo in this case. But this does not solve the problem, as we introduced 
a new arbitrary time-scale Ya. Every damping introduces a time-scale. 

Even the time-scale invariant damping factor introduces a dynamic time-scale. In 
cycle k the contribution of cycle 2^/"-A; is damped by a factor |. The effective horizon hk in 
this case is ~ k. The choice hk = (3-k with (3 ~ 2^/" qualitatively models the same behaviour. 
We have not introduced an arbitrary time-scale T, but limited the farsightedness to some 
multiple (or fraction) of the length of the current history. This avoids the pre-selection 
of a global time-scale T or Y^. This choice has some appeal, as it seems that humans 
of age k years usually do not plan their lives for more than, perhaps, the next k years 
[Inhuman = !)• From a practical point of view this model might serve all needs, but from 
a theoretical point we feel uncomfortable with such a limitation in the horizon from the 
very beginning. Note, that we have to choose /3 = 0(1) because otherwise we would again 
introduce a number (3, which has to be justified. 

The naive limit nik — oo in ( [25| ) may turn out to be well defined and the previous discussion 
superfluous. In the following, we define a limit which is always well defined (for finite |F|). 
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Let y^^^ be defined as in ( pSf ) with replaced by m. Further, let Y^"^^ := { y^™^ : ruk > m} 
be the set of outputs in cycle k for the choices rrik = m, m + l,m + 2,.... Because 
3 yim+i) ^ {}^ we have 1"^^°°^ := 0^=^ ^ {}• We define the = oo model to 
output any G This is the best output consistent with any choice of m/-, esp. 

rrik —>■ oo. Choosing the lexicographically smallest y^^^ G Y^°°^ would correspond to the 
limes inferior limm^oo2/i"^'' • vt^^ is unique, i.e. = 1 iff the naive limit limm^oo 

exists. Note, that the limit \imm~,oo Ckmi'lP^<k) needs not to exist for this construction. 

The construction above leads to a mathematically elegant, no-parameter AI^ model. Un- 
fortunately this is not the end of the story. The limit ruk —>■ oo can cause undesirable 
results in the Alfi model for special n which might also happen in the AI,^ model what- 
ever we define oo. Consider Y = C = {0,1} and X' = {}. Output yk = shall give 
credit Ck = 0, output yk = 1 shall give Cfc = 1 iff yk-i-^f-Vk-i = 0...0 for some /. I.e. the 
system can achieve / consecutive positive credits if there was a sequence of length at least 
\/l with yk = Ck = 0. If the lifetime of the AI/x system is T, it outputs yk = in the first 
r cycles and then yk = l for the remaining cycles with r such that r + = T. This 
will lead to the highest possible total credit Cit = \Jt +Y4 — ^/2- Any fragmentation of 
the and 1 sequences would reduce this. For T — > 00 the AI/x system can and will delay 
the point r of switching to = 1 indefinitely and always output with total credit 0, 
obviously the worst possible behaviour. The AI^ system will explore the above rule after 
a while of trying yk = 0/l and then applies the same behaviour as the Al/i system, since 
the simplest rules covering past data dominate ^. For finite T this is exactly what we 
want, but for infinite T the AI^ model fails just as the AI/i model does. The good point 
is, that this is not a weakness of the AI^ model, as AI/i fails too and no system can be 
better than Al/i. The bad point is that — 00 has far reaching consequences, even when 
starting from an already very large mk=T. The reason being that the /i of this example 
is highly non-local in time, i.e. it may violate one of our weak separability conditions. 

In the last paragraph we have considered the consequences of rrik— >■ 00 in the Alfi model. 
We now consider whether the AI^ model is a good approximation of the Al/i model for 
large m^. Another objection against too large choices of is that ^iifc ^kULk-.m^) has 
been proved to be a good approximation of fJ'iw^KkyLk-.mJ ^^^y ^'^^ k^hk, which is never 
satisfied for = T or ruk = 00. We have seen that, for factorizable /x, the limit hk^ 00 
causes no problem, as from a certain hk on the output yk is independent of hk. As fi 
for bounded hk, ^ will develop this separability property too. So, from a certain ko on 
the limit hk 00 might also be safe for C,- Therefore, taking the limit from the very 
beginning worsens the behaviour of AI^ maybe only for finitely many cycles k< ko, which 
would be acceptable. We suppose that the valuations Ck' for k' ^ k, where ^ can no 
longer be trusted as a good approximation to /i, are in some sense randomly disturbed 
with decreasing influence on the choice of ijk- This claim is supported by the forgetfulness 
property of ^. 

We are not sure whether the choice of is of marginal importance, as long as rrik is 
chosen sufficiently large and of low complexity, rrik = 2^^*^ for instance, or whether the 
choice of rrik will turn out to be a central topic for the AI^ model or for the planning 
aspect of any AI system in general. We suppose that the limit ruk 00 for the AI,^ model 
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results in correct behaviour for weakly separable and that even the naive limit exists, 
but to prove this would probably give interesting insights. 
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5 Sequence Prediction (SP) 

We have introduced the AI^ model as a unification of the ideas of decision theory and 
universal probability distribution. We might expect AI^ to behave identically to SP©^, 
when faced with a sequence prediction problem, but things are not that simple, as we will 
see. 

Using the AI/x Model for Sequence Prediction: We have seen in the last section 
how to predict sequences for known and unknown prior distribution fi^^. Here we consider 
binary sequences[^ Z1Z2Z3... G IB'^ with known prior probability ^^^{ ziZ2Z^... ). 

We want to show how the Al/i model can be used for sequence prediction. We will see 
that it gives the same prediction as the SP0^ system. First, we have to specify how the 
Al/i model should be used for sequence prediction. The following choice is natural: 

The systems output Uk is interpreted as a prediction for the /c*^ bit Zk of the string, which 
has to be predicted. This means that yu is binary [yk E IB =: Y) . As a reaction of the 
environment, the system receives credit = 1 if the prediction was correct (iik = Zk), or 
Cfc = if the prediction was erroneous (ykT^Zk). The question is what the input x'^ of the 
next cycle should be. One choice would be to inform the system about the correct k^^ bit 
of the last cycle of the string and set x'^ = Zk- But as from the credit in conjunction with 
the prediction yk, the true bit Zk = Sy^c^ can be inferred, this information is redundant. 
5 is the Kronecker symbol, defined as 5ab = 1 for a = b and otherwise. There is no 
need for this additional feedback. So we set x'k = e E X = {e} thus having Xk = Ck- The 
system's performance does not change when we include this redundant information, it 
merely complicates the notation. The prior probability fi"^^ of the Alfi model is 

f^'^^yiXi.-.ykXk) = fi'^^yiQ^.-ykCk) = f^^^i^y^c^-AjkCk) = /i^'^(£i^) (43) 

In the following, we will drop the superscripts of fj, because they are clear from the 
arguments of /i and the /i equal in any case. 

The formula (0) for the expected credit reduces to 

C*kmiw^<k) = niax^[cfc + C^+i,„(tA;i;fc)]-^(<5j/ici...'5y,_ic,_i5^) (44) 



The first observation we can make, is that for this special /x, only depends on 6y^c,, i-e. 
replacing yi and Cj simultaneously with their complements does not change the value of 
^km- We have a symmetry in |/jCj. For k = m+l this is definitely true as C^+i ^ = in this 
case (see (^)). For k<mwe prove it by induction. The r.h.s. of (|4^ is symmetric in yiCi for 
i<k because /i possesses this symmetry and C^+i ^ possesses it by induction hypothesis, 
so the symmetry holds for the l.h.s., which completes the proof. The prediction ijk is 

ijk = maxargC*^^(^<fc?/fc) = maxarg^[cfc + C*^i^^^(l/ri:fc)] ■/!(.. .5^^) = (45) 

Vk Vk Cfc 

^^We use Zk to avoid notational conflicts with the systems inputs Xk- 



5 SEQUENCE PREDICTION (SP) 



31 



= maxarg^Cfc-/i((5yici---^yfcCfe) = maxarg/i(ii...ifc_iy^) = maxarg/i(ii...4_i^;,) 

The first equation is the definition of the system's prediction (^). In the second equation, 
we have inserted which gives the r.h.s. of ( ^41) with niaxy^. replaced by maxargj^^. 
J2c f{---Syc---) is independent of y for any function, depending on the combination 6yc only. 
Therefore, the J2c term is independent of yk because C^+i „ as well as /i depend on 
Syi^Ck only. In the third equation, we can therefore drop this term, as adding a constant to 
the argument of maxarg^^^ does not change the location of the maximum. In the second 
last equation we evaluated the J^c^- Further, if the true credit to yi is q the true i^^ bit 
of the string must be Zi = 6y^c,- The last equation is just a renaming. 

So, the Al/i model predicts that Zk that has maximal fi probability, given zi...Zk-i. This 
prediction is independent of the choice of m^. It is exactly the prediction scheme of the 
deterministic sequence prediction with known prior SPB^ described in the last section. 
As this model was optimal, Alfi is optimal, too, i.e. has minimal number of expected 
errors (maximal expected credit) as compared to any other sequence prediction scheme. 

From this, it is already clear that the total expected credit Ckm must be related to the 
expected sequence prediction error -Eme^ (PT]). Let us prove directly that Cim(e)+-Eme^ = 



m. We rewrite C^^ in (|4^) as a function of Zi instead of yiCi as it is symmetric in yiCi. 
Further, we can pull C^^ out of the maximization, as it is independent of yk similar to 
Renaming the bounded variables yk and Ck we get 



C*kmiz<k) = maxfi{z<kZk) +Y.C*k+i,mi^i:k)-f^{^<kZk) (46) 
Recursively inserting the l.h.s. into the r.h.s. we get 

m 

C*k„^{z<k) = I]max^(2;<fc£^) (47) 

i=k 

This is most easily proven by induction. For k = m we have C^^{z^rn) =max2^ /^(-2<m^m) 
from (^61) and (|]), which equals (^T]) . By induction hypothesis, we assume that ( ^Tf ) is 
true for k. Inserting this into (^61) we get 



C*km{z<k) = max/i(z<fcZfc) + ^ 



J2 max ^{zukZk+i; 

i = k+l 2fe + l:i-l 



l^{z^kZk) 



max/i(2;<fcZfc) + ^ ^ max ^{z<kZk;i) 



^k 

i=k+l Zk:i-1 



which equals (0). This was the induction step and hence ( ^7\ ) is proven. 

By setting k = and slightly reformulating (^Tj), we get the total expected credit in the 
first m cycles 

m 

Ci:m{(^) = J2 II/^U<i)max{/i(2;<iO),/i(2;<d)} = m-E^e^ 

1=1 z<i 

with -Emeu defined in (|21|). 
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Using the AI^ Model for Sequence Prediction: Now we want to use the universal 
Al^ model instead of Al/i for sequence prediction and try to derive error bounds analog to 
(E^). Like in the AI/i case, the systems output i/k in cycle k is interpreted as a prediction 



for the k*'^ bit Zk of the string, which has to be predicted. The credit is Ck = Sy^zk and 
there are no other inputs Xk = e. What makes the analysis more difficult is that ^ is 
not symmetric in i/iCi ^ (1 — yi){l — Cj) and p3|) does not hold for ^. On the other 
hand, converges to /i^^ in the limit (|37D , and (^31) should hold asymptotically for 
in some sense. So we expect that everything proven for Alfi holds approximately for Al,^. 
The AI^ model should behave similarly to SP6^, the deterministic variant of Solomonoff 
prediction. Especially we expect error bounds similar to (0). Making this rigorous seems 
difficult. Some general remarks have been made in the last section. 

Here we concentrate on the special case of a deterministic computable environment, i.e. 
the environment is a sequence z = ziZ2..., K{zi. .. Zn*) ^K{z) <oo. Furthermore, we only 
consider the simplest horizon model = k, i.e. maximize only the next credit. This is 
sufficient for sequence prediction, as the credit of cycle k only depends on output yk and 
not on earlier decisions. This choice is in no way sufficient and satisfactory for the full 
AI^ model, as one single choice of ruk should serve for all AI problem classes. So AI^ 
should allow good sequence prediction for some universal choice of and not only for 
mk = k, which definitely does not suffice for more complicated AI problems. The analysis 
of this general case is a challenge for the future. For m^, = k the AI^ model (^) with 
x[ = e reduces to 

yk = maxarg^ Ck-^{ip<kyCk) = max&rg^{ip<kykl) = m&x&rg^{yt<kykl) (48) 

The environmental response is given by it is 1 for a correct prediction {yk = Zk) 

and otherwise. In the following, we want to bound the number of errors this prediction 
scheme makes. We need the following inequality 

^{mi--mk) > (49) 



We have to find a short program in the sum (|2^) calculating Ci...Cfc from yi...yk. If we 
knew Zi := Sy.^i for 1 < z < A; a program of size 0(1) could calculate Ci...Ck = 5y^zi---5ykZk- 
So combining this program with a shortest coding of zi...Zk leads to a program of size 
K{zi...Zk*)+0{l), which proves 



Let us now assume that we make a wrong prediction in cycle k, i.e. Cfc = 0, ijk 7^ Zk- The 
goal is to show that ^ defined by 

ik ■■= ^{mi:k) = ^{yc^kykQ.) < ^{yc<k) - ^im^kykl) < 4-i - « 

decreases for every wrong prediction, at least by some a. The < arises from the fact that 
^ is only a semimeasure. 

^mi-yl) > e(?/iCi...(i-2/fe)i) > 2-^('^*i^i-'5(i-*.)i*) = 
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In the first inequality we have used the fact that i/k maximizes by definition ( ^HD the 



argument, i.e. 1 — yk has lower probability than yk- ( ^91) has been applied in the second 
inequality. The equality holds, because Zi = Sy.c^ and 6(^i-yi,)i = 6y^o = SykCk = ^k- The last 
inequality follows from the definition of z. 

We have shown that each erroneous prediction reduces ^ by at least the a defined above. 
Together with = 1 and > for all k this shows that the system can make at most 1/a 
errors, since otherwise C,k would become negative. So the number of wrong predictions 
E;^^ of system (PS]) is bounded by 

E^^ < i = 2^(^)+«(i) < oo (50) 

for a computable deterministic environment string ziZ2---- The intuitive interpretation is 

that each wrong prediction eliminates at least one program p of size l{p) < K{z). The 
size is smaller than K{z)^ as larger policies could not mislead the system to a wrong 
prediction, since there is a program of size K{z) making a correct prediction. There are 
at most 2^*-^''"'"*^*-^'' such policies, which bounds the total number of errors. 

We have derived a finite bound for E^^, but unfortunately, a rather weak one as compared 
to (p2D. The reason for the strong bound in the SP case was that every error at least 
halves ^ because the sum of the maxarg^^ arguments was 1. Here we have 

^(yiCl...2/A:-lCfc-lOO) +^(yiCi---2/fc-i4-i01) = 1 
^(yiCl...|/fc-lCfc-llQ) + ^(yiCl...?/fe-lCfc-lli) = 1 

but maxargj^^ runs over the right top and right bottom ^, for which no sum criterion 
holds. 

The AI^ model would not be sufficient for realistic applications if the bound (0) were 
sharp, but we have the strong feeling (but only weak arguments) that better bounds pro- 
portional to K{z) analog to (^) exist. The technique used above may not be appropriate 
for achieving this. One argument for a better bound is the formal similarity between 
maxarg2^(i<fcZfc) and (|48|) , the other is that we were unable to construct an example 
sequence for which (|48| ) makes more than 0{K{z)) errors. 
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6 Strategic Games (SG) 



Introduction: A very important class of problems are strategic games, like chess. In 
fact, what is subsumed under game theory nowadays, is so general, that it includes not 
only a huge variety of games, from simple games of chance like roulette, combined with 
strategy like Backgammon, up to purely strategic games like chess or checkers or go. 
Game theory can also describe political and economic competitions and coalitions, even 
Darwinism and many more have been modeled within game theory. It seems that nearly 
every AI problem could be brought into the form of a game. Nevertheless, the intention of 
a game is that several players perform some actions with (partial) observable consequences. 
The goal of each player is to maximize some utility function (e.g. to win the game). The 
players are assumed to be rational, taking into account all information they posses. The 
different goals of the players are usually in conflict. For an introduction into game theory. 



If we interpret the AI system as one player and the environment models the other ratio- 
nal player and the environment provides the reinforcement feedback Ck, we see that the 
system-environment configuration satisfies all criteria of a game. On the other hand, we 
know that the AI system can handle more general situations, since it interacts optimally 
with an environment, even if the environment is not a rational player with conflicting 
goals. 

Strictly competitive strategic games: In the following, we restrict ourselves to de- 
terministic, strictly competitive strategic^ games with alternating moves. Player 1 makes 
move y'l^ in round k, followed by the move x'j^ of player 2. So a game with n rounds consists 
of a sequence of alternating moves y'^x\y'2x'2---y'nX'n- At the end of the game in cycle n the 
game or final board state is evaluated with C{ii[x'i...y'^x'^). Player 1 tries to maximize 
C, whereas player 2 tries to minimize C. In the simplest case, C is 1 if player 1 won the 
game, C = —1 if player 2 won and C = for a draw. We assume a fixed game length n 
independent of the actual move sequence. For games with variable length but maximal 
possible number of moves n, we could add dummy moves and pad the length to n. The 
optimal strategy (Nash equilibrium) of both players is a minimax strategy 



But note, that the minimax strategy is only optimal if both players behave rationally. 
If, for instance, player 2 has limited capabilites or makes errors and player 1 is able to 
discover these (through past moves) he could exploit these and improve his performance 
by deviating from the minimax strategy. At least, the classical game theory of Nash 
equilibria does not take into account limited rationality, whereas the AI,^ system should. 

^^In game theory, games like chess are often cahed 'extensive', whereas 'strategic' is reserved for a 
different kind of game. 



see |Tn|, ^, §T], ig]. 




(51) 



(52) 
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Using the AI^u model for game playing: In the following, we demonstrate the 
applicability of the AI model to games. The AI system takes the position of player 1. 
The environment provides the evaluation C. For a symmetric situation we could take a 
second AI system as player 2, but for simplicity we take the environment as the second 
player and assume that this environmental player behaves according to the minimax 
strategy (|5TD. The environment serves as a perfect player and as a teacher, albeit a very 
crude one as it tells the system at the end of the game, only whether it won or lost. 

The minimax behaviour of player 2 can be expressed by a (deterministic) probability 
distribution /i"^^ as the following 

{1 if = minarg ... maxminC'(|/^ ...xl_i?/l'...x'') Vl<A;<n 
otherwise 

(53) 

The probability that player 2 makes move x'^ is jU (?/^x']^...?/^x'^) which is 1 for 
defined in (^) and otherwise. 

Clearly, the AI system receives no feedback, i.e. ci = ... =Cn-i = 0, until the end of the 
game, where it should receive positive/negative/neutral feedback on a win/loss/ draw, i.e. 
Cn = C{...). The environmental prior probability is therefore 

^ ^ ^ _ / /i^^(z/'ia;'i--l/Un) if ci= ... =Cn-i = and Cn = C{y[x[...y'^x'^) 
^ yy^^^-y-^n) - | q otherwise 

(54) 

where yi = y[ and Xj = Cix[. If the environment is a minimax player (^1]) plus a crude 
teacher C, i.e. if /x"^^ is the true prior probability, the question now is, what is the 
behaviour y^^ of the AI/x system. It turns out that if we set = n the Al/i system is 
also a minimax player (|52D and hence optimal 

vt^ = maxarg^...max^C(^'<^7^;^„)./i^^(^'^^?^'^^^) = 
= maxarg^...max ^ niaxminC(i;a;<fe?/rfc,„)-/x^^(^<fc?£^^^ = (55) 

■^k -^n-l 

= ... = maxarg min ... maxmin C(jH;'^^?/r'^.„) = yf*^ 

Vk x'k+l ^" ^" 

In the first line we inserted = n and (0) into the definition (^ of ij^^ . This re- 
moves all sums over the c^. Further, the sum over x'^ gives only a contribution for 
= minarg^j^ C{x[y[...x'^y'^) by definition ( |5BD of /i'^'^. Inserting this x'„ gives the second 
line, fi'^^ is effectively reduced to a lower number of arguments and the sum over x'^ re- 
placed by min^/^. Repeating this procedure for x'^_i, ...,x'^_,_]^ leads to the last line, which 
is just the minimax strategy of player 1 defined in (|52|) . 

Let us now assume that the game under consideration is played s times. The prior 
probability then is 

s-l 

^^'^^mi-msn) = Y[f^f{WCrn+l--m{r+l)n) (56) 



6 STRATEGIC GAMES (SG) 



36 



where we have renamed the prior probabihty ( |5^ for one game to /if^. ( |56[ ) is a special 
case of a factorizable /i (^) with identical factors fir = fJ'i^ for all r and equal episode 
lengths nr+i—nr = n. The Al/i system (^) for repeated game playing also implements 
the minimax strategy, 

= maxargmin.. max min C{ifjc[.^^^.j,_^...yjci.(^^.^^J (57) 

with r such that rn<k< {r + l)n and for any choice of ruk as long as the horizon hk > n. 
This can be proved by using (|^) and (|55|) . See section @) for a discussion on separable 
and factorizable /x. 



Games of variable length: In the unrepeated case we have argued that games of 
variable but bounded length can be padded to a fixed length without effect. We now 
analyze in a sequence of games the effect of replacing the games with fixed length by 
games of variable length. The sequence y'^x\...y'^x'^ can still be grouped into episodes 
corresponding to the moves of separated consecutive games, but now the length and 
total number of games that fit into the n moves depend on the actual moves taken|^. 
C{y[x'i...yl^x'^) equals the number of games where the system wins, minus the number of 
games where the environment wins. Whenever a loss, win or draw has been achieved by 
the system or the environment, a new game starts. The player whose turn it would next 
be, begins the next game. The games are still separated in the sense that the behaviour 
and credit of the current game does not influence the next game. On the other hand, 
they are slightly entangled, because the length of the current game determines the time 
of start of the next. As the rules of the game are time invariant, this does not influence 
the next game directly. If we play a fixed number of games, the games are completely 
independent, but if we play a fixed number of total moves n, the number of games depends 
on their lengths. This has the following consequences: the better player tries to keep the 
games short, to win more games in the given time n. The poorer player tries to draw 
the games out, in order loose less games. The better player might further prefer a quick 
draw, rather than to win a long game. Formally, this entanglement is represented by the 
fact that the prior probability /i does no longer factorize. The reduced form (|57D of y^^ 
to one episode is no longer valid. Also, the behaviour y^^ of the system depends on nik, 
even if the horizon is chosen larger than the longest possible game (unless > n). 
The important point is that the system realizes that keeping games short /long can lead to 
increased credit. In practice, a horizon much larger than the average game length should 
be sufficient to incorporate this effect. The details of games in the distant future do not 
affect the current game and can, therefore, be ignored. A more quantitative analysis could 
be interesting, but would lead us too far astray. 



Using the AI,^ model for game playing: When going from the specific Alfi model, 
where the rules of the game have been explicitly modeled into the prior probability fi"^^ , 
to the universal model AI^ we have to ask whether these rules can be learned from the 

"'^''If the sum of game lengths do not fit exactly into n moves, we pad the last game appropriately. 
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assigned credits Ck- Here, another (actually the main) reason for studying the case of 
repeated games, rather than just one game arises. For a single game there is only one 
cycle of non-trivial feedback namely the end of the game - too late to be useful except 
when there are further games following. 

Even in the case of repeated games, there is only very limited feedback, at most log2 3 bits 
of information per game if the 3 outcomes win/loss/draw have the same frequency. So 
there are at least 0{K{game)) number of games necessary to learn a game of complexity 
K{game). Apart from extremely simple games, even this estimate is far too optimistic. 
As the AI^ system has no information about the game to begin with, its moves will be 
more or less random and it can win the first few games merely by pure luck. So the 
probability that the system looses is near to one and hence the information content / in 
the feedback Ck at the end of the game is much less than log2 3. This situation remains 
for a very large number of games. On the other hand, in principle, every game should be 
learnable after a very long sequence of games even with this minimal feedback only, as 
long as / ^ 0. 

The important point is that no other learning scheme with no extra information can learn 
the game more quickly. We expect this to be true as fi^^ factorizes in the case of games of 
fixed length, i.e. fi^^ satisfies a strong separability condition. In the case of variable game 
length the entanglement is also low. fi^^ should still be sufficiently separable allowing to 
formulate and prove good credit bounds for AI^. 

To learn realistic games like tic-tac-toe (noughts and crosses) in realistic time one has to 
provide more feedback. This could be achieved by intermediate help during the game. The 
environment could give positive(negative) feedback for every good(bad) move the system 
makes. The demand on whether a move is to be valued as good should be adopted to the 
gained experience of the system in such a way that approximately half of the moves are 
valuated as good and the other half as bad, in order to maximize the information content 
of the feedback. 

For more complicated games like chess, even more feedback is necessary from a practical 
point of view. One way to increase the feedback far beyond a few bits per cycle is to train 
the system by teaching it good moves. This is called supervised learning. Despite the fact 
that the AI model has only a credit feedback c^, it is able to learn by teaching, as will be 
shown in section ^. Another way would be to start with more simple games containing 
certain aspects of the true game and to switch to the true game when the system has 
learned the simple game. 

No other difficulties are expected when going from ^ to ^. Eventually will converge 
to the minimax strategy fi^^ . In the more realistic case, where the environment is not a 
perfect minimax player, AI^ can detect and exploit the weakness of the opponent. 

Finally, we want to comment on the input/output space X /Y of the AI system. In 
practical applications, Y will possibly include also illegal moves. If Y is the set of moves 
of e.g. a robotic arm, the system could move a wrong figure or even knock over the figures. 
A simple way to handle illegal moves i/k is by interpreting them as losing moves, which 
terminate the game. Further, if e.g. the input Xk is the image of a video camera which 
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makes one shot per move, X is not the set of moves by the environment but includes the 
set of states of the game board. The discussion in this section handles this case as well. 
There is no need to explicitly design the systems I/O space X/Y for a specific game. 

The discussion above on the AI^ system was rather informal for the following reason: 
game playing (the SG.^ system) has (nearly) the same complexity as fully general AI, and 
quantitative results for the Al^ system are difficult (but not impossible) to obtain. 
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7 Function Minimization (FM) 

Applications/Examples: There are many problems that can be reduced to a mini- 
mization problem (FM). The minimum of a (real valued) function f :¥ M over some 
domain F or a good approximate of it has to be found, usually with some limited re- 
sources. 

One popular example is the traveling salesman problem (TSP). Y is the set of different 
routes between towns and f{y) the length of route y & The task is to find a route of 
minimal length visiting all cities. This problem is NP hard. Getting good approximations 
in limited time is of great importance in various applications. Another example is the 
minimization of production costs (MFC), e.g. of a car, under several constraints. Y 
is the set of all alternative car designs and production methods compatible with the 
specifications and f{y) the overall cost of alternative y&Y. A related example is finding 
materials or (bio) molecules with certain properties (MAT). E.g. solids with minimal 
electrical resistance or maximally efficient chlorophyll modifications or aromatic molecules 
that taste as close as possible to strawberry. We can also ask for nice paintings (NFT). Y 
is the set of all existing or imaginable paintings and f{y) characterizes how much person 
A likes painting y. The system should present paintings, which A likes. 

For now, these are enough examples. The TSF is very rigorous from a mathematical 
point of view, as /, i.e. an algorithm of /, is usually known. In principle, the minimum 
could be found by extensive search, were it not for computational resource limitations. 
For MFC, / can often be modeled in a reliable and sufficiently accurate way. For MAT 
you need very accurate physical models, which might be unavailable or too difficult to 
solve or implement. For NFT the most we have is the judgement of person A on every 
presented painting. The evaluation function / cannot be implemented without scanning 
A's brain, which is not possible with todays technology. 

So there are different limitations, some depending on the application we have in mind. 
An implementation of / might not be available, / can only be tested at some arguments y 
and f{y) is determined by the environment. We want to (approximately) minimize / with 
as few function calls as possible or, conversely, find an as close as possible approximation 
for the minimum within a fixed number of function evaluations. If / is available or can 
quickly be inferred by the system and evaluation is quick, it is more important to minimize 
the total time needed to imagine new trial minimum candidates plus the evaluation time 
for /. As we do not consider computational aspects of AI,^ till section |10| we concentrate 
on the first case, where / is not available or dominates the computational requirements. 

The Greedy Model FMG/i : The FM model consists of a sequence yiZiy2Z2--- where 
i/k is a trial of the FM system for a minimum of / and Zk = f{yk) is the true function 
value returned by the environment. We randomize the model by assuming a probability 
distribution //(/) over the functions. There are several reasons for doing this. We might 
really not know the exact function /, as in the NFT example, and model our uncertainty 
by the probability distribution ^. More importantly, we want to parallel the other AI 
classes, like in the SF/i model, where we always started with a probability distribution fi 
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that was finally replaced by ^ to get the universal Solomonoff prediction SP^. We want to 
do the same thing here. Further, the probabilistic case includes the deterministic case by 
choosing = 5//(,, where /o is the true function. A final reason is that the deterministic 
case is trivial when /j, and hence /o is known, as the system can internally (virtually) check 
all function arguments and output the correct minimum from the very beginning. 

We will assume that Y is countable or finite and that // is a discrete measure, e.g. by 
taking only computable functions. The probability that the function values of yi, 
are zi, ...,Zn is then given by 

^l^''iy^z,...yr,z^) ^ /.(/) (58) 

f-f{yi)=Zi Vl<j<n 

We start with a model that minimizes the expectation Zk of the function value / for the 
next output t/k, taking into account previous information: 

yk := Taineiig^Zk-iJ,{yiZi...yk-iZk-iykZk) 

This type of greedy algorithm, just minimizing the next feedback, was sufficient for se- 
quence prediction (SP) and is also sufficient for classification (CF). It is, however, not 
sufficient for function minimization as the following example demonstrates. 

Take / : {0,1}— >{1,2,3,4}. There are 16 different functions which shall be equiprobable, 
li{f) — ^. The function expectation in the first cycle 

(zi) := 5:^1 -//(yiii) = \ = i(l+2+3+4) = 2.5 

Zl Zl 

is just the arithmetic average of the possible function values and is independent of yi. 
Therefore, yi = 0, as minarg is defined to take the lexicographically first minimum in an 
ambiguous case. Let us assume that /o(0) = 2, where /o is the true environment function, 
i.e. zi — 2. The expectation of Z2 is then 

E.,-M02,,.,) = I [^J llZl 

For y2 — the system already knows /(O) = 2, for y2 — i the expectation is, again, the 
arithmetic average. The system will again output y2 = with feedback Z2 = 2. This will 

continue forever. The system is not motivated to explore other y's as /(O) is already 
smaller than the expectation of /(I). This is obviously not what we want. The greedy 
model fails. The system ought to be inventive and try other outputs when given enough 
time. 

The general reason for the failure of the greedy approach is that the information contained 
in the feedback Zk depends on the output yk- A FM system can actively influence the 
knowledge it receives from the environment by the choice in yk- It may be more advanta- 
geous to first collect certain knowledge about / by an (in greedy sense) non-optimal choice 
for yk, rather than to minimize the Zk expectation immediately. The non- minimality of 
Zk might be over-compensated in the long run by exploiting this knowledge. In SP, the 
received information is always the current bit of the sequence, independent of what SP 
predicts for this bit. This is the reason why a greedy strategy in the SP case is already 
optimal. 
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The general FMyu/^ Model: To get a useful model we have to think more carefully 
about what we really want. Should the FM system output a good minimum in the last 
output in a limited number of cycles T, or should the average of the zi,...,zt values be 
minimal, or does it suffice that just one of the z is as small as possible? Let us define the 
FM/i model as to minimize the /i averaged weighted sum aiZi+ ... +aTZT for some given 
afc >0. Building the fi average by summation over the Zi and minimizing w.r.t. the r/i has 
to be performed in the correct chronological order. With a similar reasoning as in to 
(I) we get 

= mmaxg^...mm'^{aizi+ ... +aTZT)-fi{yih---yk-ih-iykZk...yTZT) (59) 

zk ^"^ ZT 

If we want the final output ijT to be optimal we should choose = for k < T and 
ttT- = 1 (final model FMF/x). If we want to already have a good approximation during 
intermediate cycles, we should demand that the output of all cycles together are optimal 
in some average sense, so we should choose = 1 for all k (sum model FMS/i). If 
we want to have something in between, for instance, increase the pressure to produce 
good outputs, we could choose the = e^'^^~'^'^ exponentially increasing for some 7 > 
(exponential model FME/x). For 7— >-cxd we get the FMF/x, for 7—^0 the FMS/x model. If 
we want to demand that the best of the outputs yi...yk is optimal, we must replace the a 
weighted z-sum by minj^i, z^-} (minimum Model FMM/x). We expect the behaviour 
to be very similar to the FMF/i model, and do not consider it further. 

By construction, the FM/z models guarantee optimal results in the usual sense that no 
other model knowing only yU can be expected to produce better results. The variety of 
FM variants is not a fault of the theory. They just refiect the fact that there is some 
interpret at ional freedom of what is meant by minimization within T function calls. In 
most applications, probably FMF is appropriate. In the NPT application one might prefer 
the FMS model. 

The interesting case (in AI) is when /i is unknown. We define for this case, the FM,^ model 
by replacing yu(/) with some ^(/), which should assign high probability to functions / of 
low complexity. So we might define[^^(/) =Y,q-:ix[u(qx)=f(x)] The problem with this 

definition is that it is, in general, undecidable whether a TM q is an implementation of 
a function /. ^(/) defined in this way is uncomputable, not even approximable. As we 
only need a ^ analog to the l.h.s. of (^8]), the following definition is natural 

e'\y,z,...y^zj := '^''^'^ (60) 

i-q(yi)=Zi vi<j<n 

^^'^^ is actually equivalent to inserting the incomputable ^(/) into (^8|) . ^^^^ is an enu- 
merable semi-measure and universal, relative to all probability distributions of the form 
(^). We will not prove this here. 

Alternatively, we could have constrained the sum in (|60|) by q{yi...yn) = Zi...Zn analog to 
(P^), but these two definitions are not equivalent. Definition ( |60|) ensures the symmetryf^ 

18 ^FM ^y-j ^ true probability distribution if we include partial functions in the domain. So normal- 
ization is not necessary. 

""^^See py] for a discussion on symmetric universal distributions on unordered data. 
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in its arguments and {■■■yz_...yz[ ...) =0 for z ^ z' . It incorporates all general knowledge 
we have about function minimization, whereas does not. But this extra knowledge 
has only low information content (complexity of 0(1)), so we do not expect FM^ to 
perform much worse when using (plj) instead of (0). But there is no reason to deviate 
from (pop at this point. 

We can now define an "error" measure E^l^^^ as ( |59| ) with k = l and minarg^^^ replaced by 
min^^ and, additionally, fi replaced by ^ for E^^^ . We expect \E^^^ —E^jl'^ \ to be bounded 
in a way that justifies the use of ^ instead of /i for computable fi, i.e. computable /o in 
the deterministic case. The arguments are the same as for the AI^ model. 



Is the general model inventive? In the following we will show that FM^ will never 
cease searching for minima, but will test an infinite set of different y's for T— s>oo. 

Let us assume that the system tests only a finite number oiyiEAc Y, \A\ <oo. Let t—1 
be the cycle in which the last new y&A is selected (or some later cycle). Selecting y's in 
cycles k>t a. second time, the feedback z does not provide any new information, i.e. does 
not modify the probability ^^^^ . The system can minimize E^^^ by outputting in cycles 
k>t the best y&A found so far (in the case afc = 0, the output does not matter). Let us 
fix / for a moment. Then we have 

t-l T 

E" := aizi+ ... +aTZT = ^akf{yk) + h-^ak , /i := min /(?/fc) 

fc=i k=t 

Let us now assume that the system tests one additional yt^ A in cycle t, but no other 
y ^ A. Again, it will keep to the best output for k > t, which is either the one of the 
previous system or yt. 

t T 

= ^afc/(z/fc) +min{/i,/(?/t)}-^ ak 

k=l k=t+l 

The difference can be represented in the form 

E^-E" = (j2<^Xf^-aff- , /± := max{0, ±(/i-/(yt))} > 

\k=t ) 

As the true FM^ strategy is the one which minimizes E^ assumption a is ruled out if 
j^a ^ j^h ^ gg^y \\\QXi b is favored over a, which does not mean that b is the correct 

strategy, only that a is not the true one. For probability distributed /, 6 is favored over 
a when 

\k=t I k=t w / 

where (/^) is the ^ expectation of ±/i =F f{yt) under the condition that ±/i > ±f{yt) and 
under the constrains imposed in cycles l...t—l. As C, assigns a strictly positive probability 
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to every non-empty event, (/+) ^0. Inserting ak = e^'^^ assumption a is ruled out in 
model FME^ if 



1 

T-t > -In 

7 



1 + \e' 



for 7 ^ oo (FMF^ model) 

(D/if^) - 1 for 7^0 (FMS^ model) 



We see that if the condition is not satisfied for some t, it will remain wrong for all t' > t. 
So the FMF^ system will test each y only once up to a point from which on it always 
outputs the best found y. Further, for T^oo the condition always gets satisfied. As this 
is true for any finite A, the assumption of a finite A is wrong. For T— »oo the system tests 
an increasing number of different y's, provided Y is infinite. The FMF^ model will never 
repeat any y except in the last cycle T where it chooses the best found y. The FMS^ 
model will test a new yt for fixed T, only if the expected value of f{yt) is not too large. 

The above does not necessarily hold for different choices of a^. The above also holds for 
the FMF/i system if (/+) 7^ 0. (/+) = if the system can already exclude that yt is a 
better guess, so there is no reason to test it explicitly. 

Nothing has been said about the quality of the guesses, but for the FM/x system they are 
optimal by definition. If K{n) for the true distribution n is finite, we expect the FM^ 
system to solve the "exploration versus exploitation" problem in a universally optimal 
way, as ^ converges to fi. 



Using the AI models for Function Mininimization: The AI model can be used 
for function minimization in the following way. The output of cycle is a guess for 
a minimum of /, like in the FM model. The credit should be high for small function 
values Zk = f{yk)- The credit should also be weighted with ak to reflect the same strategy 
as in the FM case. The choice of Ck = —akZk is natural. Here, the feedback is not binary 
but Ck & C C M, with C being a countable subset of M, e.g. the computable reals or 
all rational numbers. The feedback x'^. should be the function value f{yk)- So we set 
x'k = Zk- Note, that there is a redundancy if is a computable function with no zeros, 
as Ck = —akx'f.. So, for small K{aQj) like in the FMS model, one might set Xk = e. If we 
keep x'j^ the AI prior probability is 

,,Ai(^, ^ „ ^ N / /^™(l/i^i--Z/n^„) for Ck = -akZk, 4 = Zk, Xk = Ckx'f, , . 

[yiXi...ynXn) — S Q gjg^ [Gi-) 

Inserting this into (j^) with nik = T we get 

y^^ = maxarg^ ... max^(cfc+ ... +CT)-/i'^^(2/iXi...?/A:Xfe...yTXT) = 

Vk Xk XT 



= mmeiTg'^...mm'^{akZk+ ... +aTZT)-fi^^\yiZi---ykZk...yTZT) = yl^^ 

Vk Zk ZT 

where y^^^ has been defined in (^). The proof of equivalence was so simple because the 
FM model has already a rather general structure, which is similar to the full AI model. 
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One might expect no problems when going from the aheady very general FM^ model to 
the universal AI^ model (with = T), but there is a pitfall in the case of the FMF 
model. All credits Ck are zero in this case, except for the last one being ct- Although 
there is a feedback Zk in every cycle, the AI^ system cannot learn from this feedback as 
it is not told that in the final cycle ct will equal to —zt- There is no problem in the FM^ 
model because in this case this knowledge is hardcoded into The AI^ model must 

first learn that it has to minimize a function but it can only learn if there is a non-trivial 
credit assignment Cfc. FMF works for repeated minimization of (different) functions, such 
as minimizing N functions in A^-T cycles. In this case there are N non-trivial feedbacks 
and AI^ has time to learn that there is a relation between CkT and x'f^rp every T*'^ cycle. 
This situation is similar to the case of strategic games discussed in section |[ 

There is no problem in applying AI^ to FMS because the c feedback provides enough 
information in this case. The only thing the AI^ model has to learn, is to ignore the x 
feedbacks as all information is already contained in c. Interestingly the same argument 
holds for the FME model if -^'(7) and K{T) are smallQ. The AI^ model has additionally 
only to learn the relation Ck = —e~'~^'^^~^^x'f,. This task is simple as every cycle provides 
one data point for a simple function to learn. This argument is no longer valid for 7 — >^ cxd 
as -^'(7) ^00 in this case. 

Remark: TSP seems to be trivial in the AI/i model but non-trivial in the AI^ model. 
The reason being that (|59|) just implements an internal complete search as fi{f)=6ffTSP 
contains all necessary information. Al/i outputs from the very beginning, the exact mini- 
mum of f'^^P , This "solution" is, of course, unacceptable from performance perspective. 
As long as we give no efficient approximation of ^, we have not contributed anything 
to a solution of the TSP by using AI^'^. The same is true for any other problem where / 
is computable and easily accessible. Therefore, TSP is not (yet) a good example because 
all we have done is to replace a NP complete problem with the uncomputable AI^ model 
or by a computable AI^'^ model, for which we have said nothing about computation time 
yet. It is simply an overkill to reduce simple problems to AI^. TSP is a simple problem in 
this respect, until we consider the AI^'^ model seriously. For the other examples, where / 
is inaccessible or complicated, AI^'^ provides a true solution to the minimization problem 
as an explicit definition of / is not needed for AI^ and AI^^. 



If we set afc — e'^^ the condition on K{T) can be dropped. 
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8 Supervised Learning by Examples (EX) 

The AI models provide a frame for reinforcement learning. The environment provides a 
feedback c, informing the system about the quality of its last output y; it assigns credit 
c to output y. In this sense, reinforcement learning is explicitly integrated into the Alp 
model. For p = /i it maximizes the true expected credit, whereas the Al^ model is a 
universal, environment independent, reinforcement learning algorithm. 

There is another type of learning method: Supervised learning by presentation of examples 
(EX). Many problems learned by this method are association problems of the following 
type. Given some examples x&R (ZX, the system should reconstruct, from a partially 
given x', the missing or corrupted parts, i.e. complete x' to x such that relation R contains 
X. In many cases, X consists of pairs {z,v), where v is the possibly missing part. 

Applications/Examples: Learning functions by presenting {z, f{z)) pairs and asking 
for the function value of z by presenting (^, ?) also falls into this category. 

A basic example is learning properties of geometrical objects coded in some way. E.g. if 
there are 18 different objects characterized by their size (small or big), their colors (red, 
green or blue) and their shapes (square, triange, circle), then [object, property) &R if the 
object possesses the property. Here, i? is a relation which is not the graph of a single 
valued function. 

When teaching a child, by pointing to objects and saying "this is a tree" or "look how 
green" or "how beautiful", one establishes a relation of (object, property) pairs in R. 
Pointing to a (possibly different) tree later and asking "what is this ?" corresponds to a 
partially given pair (object,!), where the missing part "?" should be completed by the 
child saying " tree" . 

A final example we want to give is chess. We have seen that, in principle, chess can be 
learned by reinforcement learning. In the extreme case the environment only provides 
credit c — 1 when the system wins. The learning rate is completely inacceptable from 
a practical point of view. The reason is the very low amount of information feedback. 
A more practical method of teaching chess is to present example games in the form of 
sensible (board-state, move) sequences. They contain information about legal and good 
moves (but without any explanation). After several games have been presented, the 
teacher could ask the system to make its own move by presenting (board-state, ?) and 
then evaluate the answer of the system. 

Supervised leeirning with the AI/x/^ model: Let us define the EX model as follows: 
The environment presents inputs x'^ — ZkVk = (zk,Vk) G RL)(Zx{7}) C Zx(Y[j{7}) — X' 
to the system in cycle k. The system is expected to output y^+i in the next cycle, which 
is evaluated with c^+i = 1 if (zk,yk+i) ^ R and otherwise. To simplify the discussion, 
an output yk is expected and evaluated even when Vk(^l) is given. To complete the 
description of the environment, the probability distribution iir(x']^...x'^) of the examples 
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(depending on R) has to be given. Wrong examples should not occur, i.e. fiR should 
be if x'j ^ R for some 1 <i<n. The relations R might also be probability distributed 
with a{R). The example prior probability in this case is 

f^{ x[...x'J = ^fXR{ x[...x'J -a{R) (62) 

R 

The knowledge of the valuation Ck on output yk restricts the possible relations R, consis- 
tent with R{zk,yk+i) =Ck+i, where R{z,y) := 1 if {z,y) G R and otherwise. The prior 
probability for the input sequence xi...Xn if the output sequence is yi...yn, is therefore 

l^^\yixi...ynx^) = Yl I^R{ x'i...x'J -a{R) 

R:\fl<i<n[R{zi,yi+i)=Ci+i] 

where Xi — Cix[ and x^_i — ZiVi with Vi e FU{?}. In the I/O sequence yiXiy2X2--- — 
yiCiZ2V2y2C2Z3V3... the Ciyi are dummies, after which regular behaviour starts. 

The Al/i model is optimal by construction of /i^^ . For computable prior fiR and a, we 
expect a near optimal behavior of the universal AI,^ model if fiR additionally satisfies 
some separability property. In the following, wc give some motivation why the AI^ model 
takes into account the supervisor information contained in the examples and why it learns 
faster than by reinforcement. 

We keep R fixed and assume hr{x'i...x'^)=ij,ji{x'i)- ... ■iir{x'^)^Q -v^^ a;^Gi?U(Zx{?}) Vi to 
simplify the discussion. Short codes q contribute mostly to ^^^{yiXi...ynX^. As x']^...x'^ 
is distributed according to the computable probability distribution /xr, a short code of 
x'i...x'^ for large enough n is a Huffman coding w.r.t. the distribution /i^. So we expect /i^ 
and hence R coded in the dominant contributions to in some way, where the plausible 
assumption was made that the y on the input tape do not matter. Much more than one 
bit per cycle will usually be learned, hence, relation R can be learned in n<^K{R) cycles 
by appropriate examples. This coding of i? in g evolves independently of the feedbacks c. 
To maximize the feedback c^, the system has to learn to output a yk+i with {zk, yk+i) ^ R- 
The system has to invent a program extension q' to q, which extracts Zk from Xk = ZkVk 
and searches for and outputs a yk+i with {zk, yk+i) ^R- As R is already coded in q, q' can 
re-use this coding of it! in q. The size of the extension q' is, therefore, of 0(1). To learn 
this q', the system requires feedback c with information content of 0{1) — K{q') only. 

Let us compare this with reinforcement learning, where only x'^ = {zk, ?) pairs are pre- 
sented. A coding of i? in a short code q for x'i...x'^ is of no use and will therefore be absent. 
Only the credits c force the system to learn R. q' is therefore expected to be of size K{R). 
The information content in the c's must be of the order K{R). In practice, there are 
often only very few = 1 at the beginning of the learning phase and the information 
content in Ci...c„ is much less than n bits. The required number of cycles to learn R by 
reinforcement is, therefore, at least but in many cases much larger than K{R). 

Although AI^ was never designed or told to learn supervised, it learns how to take advan- 
tage of the examples from the supervisor, ^r and R are learned from the examples, the 

credits c are not necessary for this process. The remaining task of learning how to learn 
supervised is then a simple task of complexity 0(1), for which the credits c are necessary. 
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9 Other AI Classes 



Other aspects of intelligence: In AI, a variety of general ideas and methods have 
been developed. In the last sections, we have seen how several problem classes can be 
formulated within AI^. As we claim universality of the AI^ model, we want to enlight 
which of, and how the other AI methods are incorporated in the AIi^ model, by looking 
its structure. Some methods are directly included, others are or should be emergent. We 
do not claim the following list to be complete. 

Probability theory and utility theory are the heart of the Al/i/^ models. The probabilities 
are the true/universal behaviours of the environment. The utility function is what we 
called total credit, which should be maximized. Maximization of an expected utility 
function in a probabilistic environment is usually called sequential decision theory, and 
is explicitly integrated in full generality in our model. This includes probabilistic (a 
generalization of deterministic) reasoning, where the object of reasoning are not true 
or false statements, but the prediction of the environmental behaviour. Reinforcement 
Learning is explicitly built in, due to the credits. Supervised learning is an emergent 
phenomenon (section H). Algorithmic information theory leads us to use ^ as a universal 
estimate for the prior probability /i. 



For horizon > 1, the alternative series of expectimax series in (16) and the process of 
selecting maximal values can be interpreted as abstract planning. This expectimax series 
also includes informed search, in the case of AI/i, and heuristic search, for AI,^, where 
^ could be interpreted as a heuristic for /z. The minimax strategy of game playing in 
case of Al/i is also subsumed. The AI^ model converges to the minimax strategy if the 
environment is a minimax player but it can also take advantage of environmental players 
with limited rationality. Problem solving occurs (only) in the form of how to maximize 
the expected future credit. 

Knowledge is accumulated by AI^ and is stored in some form not specified further on the 
working tape. Any kind of information in any representation on the inputs y is exploited. 
The problem of knowledge engineering and representation appears in the form of how to 
train the AI^ model. More practical aspects, like language or image processing have to 
be learned by AI^ from scratch. 

Other theories, like fuzzy logic, possibility theory, Dempster- Shafer theory, ... are partly 
outdated and partly reducible to Bayesian probability theory P] . The interpretation and 
effects of the evidence gap g := l — J^x^ i{w<kULk) > in ^ may be similar to those in 
Dempster-Shafer theory. Boolean logical reasoning about the external world plays, at 
best, an emergent role in the AI^ model. 

Other methods, which don't seem to be contained in the AI^ model might also be emergent 
phenomena. The AI^ model has to construct short codes of the environmental behaviour, 
the AI^*' (see next section) has to construct short action programs. If we would analyze 
and interpret these programs for realistic environments, we might find some of the un- 
mentioned or unused or new AI methods at work in these algorithms. This is, however, 
pure speculation at this point. More important: when trying to make AIi^ practically 
usable, some other AI methods, like genetic algorithms or neural nets, may be useful. 
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The main thing we wanted to point out is that the AI^ model does not lack any important 
known property of intelligence or known AI methodology. What is missing, however, are 
computational aspects, which are addressed, in the next section. 
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10 Time Bounds and Effectiveness 

Introduction: Until now, we liave not bothered witli tlie non-computability of tlie 
universal probability distribution ^. As all universal models in this paper are based on 
^, they are not effective in this form. In this section, We will outline how the previous 
models and results can be modified/generalized to the time-bounded case. Indeed, the 
situation is not as bad as it could be. ^ and C are enumerable and yk is still approximable 
or computable in the limit. There exists an algorithm, that will produce a sequence of 
outputs eventually converging to the exact output yk, but we can never be sure whether 
we have already reached it. Besides this, the convergence is extremely slow, so this 
type of asymptotic computability is of no direct (practical) use, but will nevertheless, be 
important later. 

Let p be a program which calculates within a reasonable time t per cycle, a reasonable 
intelhgent output, i.e. p(i<fc) =yi:k- This sort of computability assumption, that a general 
purpose computer of sufficient power is able to behave in an intelligent way, is the very 
basis of AI, justifying the hope to be able to construct systems which eventually reach 



and outperform human intelligence. For a contrary viewpoint see I^Sj. It is not necessary 
to discuss here, what is meant by 'reasonable time/intelligence' and 'sufficient power'. 
What we are interested in, in this section, is whether there is a computable version AI^* 
of the AI,^ system which is superior or equal to any p with computation time per cycle 
of at most t. With 'superior', we mean 'more intelligent', so what we need is an order 
relation (like) (|39D for intelligence. 



The best result we could think of would be an AI^* with computation time < t at least 
as intelligent as any p with computation time < t. If AI is possible at all, we would have 
reached the final goal, the construction of the most intelligent algorithm with computation 
<t. Just as there is no universal measure in the set of computable measures (within time 
t), such an AI^* may neither exist. 

What we can realistically hope to construct, is an AI^* system of computation time ot per 
cycle for some constant c. The idea is to run all programs p of length <l: = l{p) and time 
<t per cycle and pick the best output. The total computation time is 2'-t, hence c = 2K 
This sort of idea of 'typing monkeys' with one of them eventually writing Shakespeare, 
has been applied in various forms and contexts in theoretical computer science. The 
realization of this best vote idea, in our case, is not straightforward and will be outlined 
in this section. An idea related to this, is that of basing the decision on the majority of 
algorithms. This 'democratic vote' idea has been used in [^D|, Q for sequence prediction, 
and is referred to as 'weighted majority' there. 



Time limited probability distributions: In the literature one can find time limited 
versions of Kolmogorov complexity [|, |l^] and the time limited universal semimeasure 
21|, E3. In the following, we utilize and adapt the latter and see how far we get. One 



way to define a time-limited universal chronological semimeasure is as a sum over all 
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enumerable chronological semimeasures computable within time t and of size at most / 
similar to the unbounded case (^). 

e{mi:n) := E 2-'(^Vte:n) (63) 

p ■■ Kp)<'i- ^ t{p)<i 

Let us assume that the true environmental prior probability ii^^ is equal to or sufficiently 
accurately approximated by a p with l{p) <l and t(p) <t with t and I of reasonable size. 
There are several AI problems that fall into this class. In function minimization of section 
^, the computation of / and p-^^ are usually feasible. In many cases, the sequences of 
section |^ which should be predicted, can be easily calculated when ii^^ is known. In 
a classifier problem, the probability distribution p'"'^, according to which examples are 
presented, is, in many cases, also elementary. But not all AI problems are of this 'easy' 
type. For the strategic games of section |^, the environment is usually, itself, a highly 
complex strategic player with a difficult to calculate p"^*^ that is difficult to calculate, 
although one might argue that the environmental player may have limited capabilities 
too. But it is easy to think of a difficult to calculate physical (probabilistic) environment 
like the chemistry of biomolecules. 

The number of interesting applications makes this restricted class of AI problems, with 
time and space bounded environment p*', worth being studied. Superscripts to a prob- 
ability distribution except for indicate their length and maximal computation time, 
defined in (p^D, with a yet to be determined computation time, multiplicatively domi- 



nates all yU*' of this type. Hence, an AI^*' model, where we use as prior probability, is 
universal, relative to all Alp*' models in the same way as AI,^ is universal to Alp for all 
enumerable chronological semimeasures p. The maxarg^^ in (pSD selects a Uk for which 
has the highest expected utility Ckm,,, where is the weighted average over the p*'. 
i)^^^ is determined by a weighted majority. We expect AI^^^ to outperform all (bounded) 
AIp^\ analog to the unrestricted case. 

In the following we analyze the computability properties of and AI,^*', i.e. of ij^^^ . 
To compute according to the definition ( |63D we have to enumerate all chronological 
enumerable semimeasures p*' of length < / and computation time < t. This can be done 
similarly to the unbounded case (|g-|3|). All 2' enumerable funct ions of length < /, 



computable within time t have to be converted to chronological probability distributions. 
For this, one has to evaluate each function for different arguments. Hence, is 

computable within time[^ t(^*''(ja;i.j!,)) = 0{\X\-k-2^ -t). The computation time of ij^^^ 
depends on the size oi X,Y and m^. has to be evaluated |y|'''-'|X|'''-' times in (PB]). It 
is possible to optimize the algorithm and perform the computation within time 

t(2/f«") = 0{\Y\'"^\X\^'^ -2' -t) (64) 

per cycle. If we assume that the computation time of p*' is exactly t for all arguments, 
the brute force time i for calculating the sums and maxs in @ is 



21 



We assume that a TM can be simulated by another in hnear time. 
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Combining this with (0), we get 

This result has the proposed structure, that there is a universal AI,^*' system with com- 
putation time 2^ times the computation time of a special AlyU*' system. 

Unfortunately, the class of Al/i*' systems with brute force evaluation of i/k, according to 
(y) is completely uninteresting from a practical point of view. E.g. in the context of 
chess, the above result says that the AI^*' is superior within time 2'-t to any brute force 
minimax strategy of computation time t. Even if the factor of 2^ in computation time 
would not matter, the AI^*' system is, nevertheless practically useless, as a brute force 
minimax chess player with reasonable time t is a very poor player. 

Note, that in the case of sequence prediction {hk = 1, \Y\ = \X\ =2) the computation time 
of p coincides with that of y^^'^ within a factor of 2. The class Alp*' includes all non- 
incremental sequence prediction algorithms of size <l and computation time <t/2. With 
non- incremental, we mean that no information of previous cycles is taken into account 
for the computation of i/k of the current cycle. 

The shortcomings (mentioned and unmentioned ones) of this approach are cured in the 
next subsection, by deviating from the standard way of defining a timebounded ^ as a 
sum over functions or programs. 

The idea of the best vote algorithm: A general cybernetic or AI system is a chrono- 
logical program p{x^k) = Ui-.k- This form, introduced in section ^, is general enough to 
include any AI system (and also less intelligent systems). In the following, we are inter- 
ested in programs p of length < / and computation time < t per cycle. One important 
point in the time-limited setting is that p should be incremental, i.e. when computing 
yk in cycle k, the information of the previous cycles stored on the working tape can be 
re-used. Indeed, there is probably no practically interesting, non- incremental AI system 
at all. 

In the following, we construct a policy p*, or more precisely, policies pi for every cycle 
k that outperform all time and length limited AI systems p. In cycle k, pi runs all 2' 
programs p and selects the one with the best output yk- This is a 'best vote' type of 
algorithm, as compared to the 'weighted majority' like algorithm of the last subsection. 
The ideal measure for the quality of the output would be the ^ expected credit 

CUp\iA<k) ■■= J2^''^'^CUP,q) , Cfc„(p,g) := c(xf ) + ... + c(x^^) (65) 

The program p which maximizes Ckm^ should be selected. We have dropped the nor- 
malization Af unlike in (pH]), as it is independent of p and does not change the order 
relation which we are solely interested in here. Furthermore, without normalization, Ckm 
is enumerable, which will be important later. 
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Extended chronological programs: In the (functional form of the) AI^ model it was 
convenient to maximize Ckm,, over all pEPk, i.e. all p consistent with the current history 
ific^k- This was no restriction, because for every possibly inconsistent program p there 
exists a program p' e Pk consistent with the current history and identical to p for all 
future cycles > k. For the time limited best vote algorithm p* it would be too restrictive 
to demand p& Pk- To prove universality, one has to compare all 2' algorithms in every 
cycle, not just the consistent ones. An inconsistent algorithm may become the best one 
in later cycles. For inconsistent programs we have to include the y^. into the input, i.e. 
p{w^<k) = yi:k with yi ^ yf possible. For p & Pk this was not necessary, as p knows the 
output yk = yl ill this case. The ^ in the definition of Ckm are the valuations emerging in 
the I/O sequence, starting with yjc^^k (emerging from p*) and then continued by applying 
p and q with '-—yf for i>k. 

Another problem is that we need Ckmk to select the best policy, but unfortunately Ckmk 
is uncomputable. Indeed, the structure of the definition of Ckmk ^^^y similar to that 
of t/k, hence a brute force approach to approximate Ckm^ requires too much computation 
time as for yk- We solve this problem in a similar way, by supplementing each p with a 
program that estimates Ckm,, by w^. within time i. We combine the calculation of y^ and 
and extend the notion of a chronological program once again to 

P(#<fe) = Wiyl...wlyl (66) 

with chronological order WiyiyiXiW2y2y2X2---- 

Valid approximations: p might suggest any output y^. but it is not allowed to rate it 
with an arbitrarily high w^. if we want wf. to be a reliable criterion for selecting the best 
p. We demand that no policy is allowed to claim that it is better than it actually is. We 
define a (logical) predicate VA(p) called valid approximation, which is true if, and only if, 
p always satisfies w^<Ckmk{p)-i never overrates itself. 

VA(p) = \/k\/wlylyiXi...wlyl\p{ifc<k)=w{yl...wlyl^wl<Ckmk{^^ (67) 

In the following, we restrict our attention to programs p, for which VA(p) can be proved 
in some formal axiomatic system. A very important point is that C^mk is enumerable. 
This ensures the existence of sequences of program Pi,P2,P3, ■■■ for which VA(pj) can be 
proved and limi^coWk' — Ckm^ip) for all k and all I/O sequences. The approximation is 
not uniform in k, but this does not matter as the selected p is allowed to change from 
cycle to cycle. 

Another possibility would be to consider only those p which check w^. < Ckmk (p) online 
in every cycle, instead of the pre-check VA{p), either by constructing a proof (on the 
working tape) for this special case, or it is already evident by the construction of w^. In 
cases where p cannot guarantee < Ckmk (p) it sets Wk = and, hence, trivially satisfies 
Wk'^Ckmkip)- Oil the other hand, for these p it is also no problem to prove VA(p) as one 
has simply to analyze the internal structure of p and recognize that p shows the validity 
internally itself, cycle by cycle, which is easy by assumption on p. The cycle by cycle 
check is, therefore, a special case of the pre-proof of VA(p). 
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Effective intelligence order relation: In section ^ we have introduced an intelligence 
order relation >z on Al systems, based on the expected credit Ckm^ip)- In the following 
we need an order relation >z'^ based on the claimed credit wl which might be interpreted 
as an approximation to ^. We call p effectively more or equally intelligent than p' if 

p y'^p' \/k'^ip:^k^Wi,nW'l:n ■ /ggN 

p{w;<k) = wi* ...Wk* Ap\yb<k) = w[* ...w',^* AWk>w',^ 

i.e. if p always claims higher credit estimate w than p'. is a co-enumerable partial 
order relation on extended chronological programs. Restricted to valid approximations 
it orders the policies w.r.t. the quality of their outputs and their ability to justify their 
outputs with high Wk- 

The universal time bounded AI^*' system: In the following we, describe the algo- 
rithm p* underlying the universal time bounded AI^*' system. It is essentially based on 
the selection of the best algorithms pi out of the time i and length / bounded p, for which 
there exists a proof of VA(p) with length <lp. 

1. Create all binary strings of length Ip and interpret each as a coding of a mathe- 
matical proof in the same formal logic system in which VA(-) has been formulated. 
Take those strings which are proofs of VA(p) for some p and keep the corresponding 
programs p. 

2. Eliminate all p of length >Z. 

3. Modify all p in the following way: all output w^.y^ is temporarily written on an 
auxiliary tape. If p stops in i steps the internal 'output' is copied to the output 
tape. If p does not stop after t steps a stop is forced and Wk = and some arbitrary 
yk is written on the output tape. Let P be the set of all those modified programs. 

4. Start first cycle: k:=l. 

5. Run every pEP on extended input ?/c<fc, where all outputs are redirected to some 
auxiliary tape: p{ifc^k)=wlyl...wlyl. 

6. Select the program p with highest claimed credit ly^: : = maxargp . 

7. Write yk'- = y^k output tape. 

8. Receive input Xk from the environment. 

9. Begin next cycle: k:=k + l, goto step |^. 

It is easy to see that the following theorem holds. 

Main theorem: Let p be any extended chronological (incremental) program like (|66| ) 
of length l{p) <l and computation time per cycle t{p) <t, for which there exists a proof of 
VA(p) defined in (|67|) of length <lp. The algorithm p* constructed in the last subsection, 
depending on I, t and Ip but not on p, is effectively more or equally intelligent, according 
to >z'^ defined in (|6^) than any such p. The size of p* is l{p*) =0(ln{l-i4p)), the setup-time 
is tsetupip*) =0{lp-2^^), the computation time per cycle is tcycie{p*)=0{2} -i). 
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Roughly speaking, the theorem says, that if there exists a computable solution to some 
AI problem at all, the explicitly constructed algorithm p* is such a solution. Although 
this theorem is quite general, there are some limitations and open questions which we 
discuss in the following. 

Limitations and open questions: 

• Formally, the total computation time of p* for cycles l...k increases linearly with 
k, i.e. is of order 0{k) with a coefficient 2'-f. The unreasonably large factor 2' is 
a well known drawback in best/democratic vote models and will be taken without 
further comments, whereas the factor t can be assumed to be of reasonable size. If 
we don't take the limit k^oo but consider reasonable k, the practical usefulness of 
the timebound on p* is somewhat limited, due to the additional additive constant 
0{lp-2^p). It is much larger than k-2^-i as typically Zp>Z(VA(p)) >Z(p) = [. 

• p* is superior only to those p which justify their outputs (by large w^. It might be 
possible that there are p which produce good outputs within reasonable time, but 
it takes an unreasonably long time to justify their outputs by sufficiently high w^. 
We do not think that (from a certain complexity level onwards) there are policies 
where the process of constructing a good output is completely separated from some 
sort of justification process. But this justification might not be translatable (at least 
within reasonable time) into a reasonable estimate of Ckmk ip) ■ 

• The (inconsistent) programs p must be able to continue strategies started by other 
policies. It might happen that a policy p steers the environment to a direction for 
which it is specialized. A 'foreign' policy might be able to displace p only between 
loosely bounded episodes. There is probably no problem for factorizable jj,. Think 
of a chess game, where it is usually very difficult to continue the game/strategy of 
a different player. When the game is over, it is usually advantageous to replace 
a player by a better one for the next game. There might also be no problem for 
sufficiently separable /i. 

• There might be (efficient) valid approximations p for which VA(p) is true but not 
provable, or for which only a very long (>/p) proof exists. 

Remarks: 

• The idea of suggesting outputs and justifying them by proving credit bounds im- 
plements one aspect of human thinking. There are several possible reactions to an 
input. Each reaction possibly has far reaching consequences. Within a limited time 
one tries to estimate the consequences as well as possible. Finally, each reaction is 
valued and the best one is selected. What is inferior to human thinking is, that the 
estimates wl must be rigorously proved and the proofs are constructed by blind ex- 
tensive search, further, that all behaviours p of length < I are checked. It is inferior 
'only' in the sense of necessary computation time but not in the sense of the quality 
of the outputs. 
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In practical applications there are often cases with short and slow programs Ps 
performing some task T, e.g. the computation of the digits of tt, for which there 
also exist long and quick programs pi too. If it is not too difficult to prove that this 
long program is equivalent to the short one, then it is possible to prove K{T) <l{ps) 
within time t{pi). Similarly, the method of proving bounds Wk for Ckmk can give 
high lower bounds without explicitly executing these short and slow programs, which 
mainly contribute to Ckmk- 

Dovetailing all length and time- limited programs is a well known elementary idea 
(typing monkeys). The crucial part which has been developed here, is the selection 
criterion for the most intelligent system. 

By construction of Al,^*' and due to the enumerability of Ckm^^ ensuring arbitrary 
close approximations of Ckmk expect that the behaviour of AI(^*^ converges to the 
behaviour of AI^ in the limit t,l^oo in a sense. 

Depending on what you know/assume that a program p of size / and computation 
time per cycle i is able to achieve, the computable AI^*' model will have the same 
capabilities. For the strongest assumption of the existence of a Turing machine, 
which outperforms human intelligence, the AI,^*' will do too, within the same time 
frame up to a (unfortunately very large) constant factor. 
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11 Outlook & Discussion 

This section contains some discussion of otherwise unmentioned topics and some (more 
personal) remarks. It also serves as an outlook to further research. 

Miscellaneous: 



In game theory |26] one often wants to model the situation of simultaneous actions, 
whereas the AI^ models has serial I/O. Simultaneity can be simulated by withhold- 
ing the environment from the current system's output t/k, until Xk has been received 
by the system. Formally, this means that ^(jfc ^kULk) is independent of yu- The AI,^ 
system is already of simultaneous type in an abstract view if the behaviour p is 
interpreted as the action. In this sense, AI,^ is the action p* which maximizes the 
utility function (credit), under the assumption that the environment acts according 
to ^. The situation is different from game theory as the environment is not mod- 
eled to be a second 'player' that tries to optimize his own utility although it might 
actually be a rational player (see section ||). 

In various examples we have chosen differently specialized input and output spaces 
X and Y . It should be clear that, in principle, this is unnecessary, as large enough 
spaces X and F, e.g. 2^^ bit, serve every need and can always be Turing reduced to 
the specific presentation needed internally by the AI^ system itself. But it is clear 
that using a generic interface, such as camera and monitor for, learning tic-tac-toe 
for example, adds the task of learning vision and drawing. 



Outlook: 



Rigorous proofs for credit bounds are the major theoretical challenge are - general 
ones as well as tighter bounds for special environments /i. Of special importance are 
suitable (and acceptable) conditions to /x, under which yk and finite credit bounds 
exist for infinite Y , X and uik. 

A direct implementation of the AI^*' model is ,at best, possible for toy environments 
due to the large factor 2' in computation time. But there are other applications 
of the AI^ theory. We have seen in several examples how to integrate problem 
classes into the AI^ model. Conversely, one can downscale the AI^ model by using 
more restricted forms of ^. This could be done in the same way as the theory 
of universal induction has been downscaled with many insights to the Minimum 
Description Length principle p2|, Q or to the domain of finite automata 0. The AIi^ 



model might similarly serve as a super model or as the very definition of (universal 
unbiased) intelligence, from which specialized models could be derived. 
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With a reasonable computation time, the AI^ model would be a solution of AI (see 
next point if you disagree). The AI^*' model was the first step, but the elimination of 
the factor 2' without giving up universality will (almost certainly) be a very difficult 
task. One could try to select programs p and prove VA(p) in a more clever way than 
by mere enumeration, to improve performance without destroying universality. All 
kinds of ideas like, genetic algorithms, advanced theorem provers and many more 
could be incorporated. But now we are in trouble. We seem to have transferred 
the AI problem just to a different level. This shift has some advantages (and also 
some disadvantages) but presents, in no way, a solution. Nevertheless, we want 
to stress that we have reduced the AI problem to (mere) computational questions. 
Even the most general other systems the author is aware of, depend on some (more 
than computational) assumptions about the environment or it is far from clear 
whether they are, indeed, universal and optimal. Although computational questions 
are themselves highly complicated, this reduction is a non-trivial result. A formal 
theory of something, even if not computable, is often a great step toward solving a 
problem and has also merits of its own, and AI should not be different (see previous 
item) . 

Many researchers in AI believe that intelligence is something complicated and cannot 
be condensed into a few formulas. It is more a combining of enough methods and 
much explicit knowledge in the right way. From a theoretical point of view, we 
disagree as the AI^ model is simple and seems to serve all needs. From a practical 
point of view we agree to the following extent. To reduce the computational burden 
one should provide special purpose algorithms (methods) from the very beginning, 
probably many of them related to reduce the complexity of the input and output 
spaces X and Y by appropriate preprocessing methods. 

There is no need to incorporate extra knowledge from the very beginning. It can be 
presented in the first few cycles in any format. As long as the algorithm to interpret 
the data is of size 0(1), the AI^ system will 'understand' the data after a few cycles 
(see section If the environment /i is complicated but extra knowledge z makes 
K{ii\z) small, one can show that the bound (pO|) reduces to ^\n2 ■ K{fi\z) when 
Xi = z, i.e. when z is presented in the first cycle. The special purpose algorithms 
could be presented in xi, too, but it would be cheating to say that no special purpose 
algorithms had been implemented in AI^. The boundary between implementation 
and training is unsharp in the AI^ model. 

We have not said much about the training process itself, as it is not specific to the 
AI^ model and has been discussed in literature in various forms and disciplines. 
A serious discussion would be out of place. To repeat a truism, it is, of course, 
important to present enough knowledge x'l^ and evaluate the system output yk with 
Cfc in a reasonable way. To maximize the information content in the credit, one 
should start with simple tasks and give positive reward Cfe = 1 to approximately half 
of the outputs yk- 
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The big questions: This subsection is devoted to the big questions of AI in general 
and the AI^ model in particular with a personal touch. 

• There are two possible objections to AI in general and, therefore, also against AI^ 
in particular we want to comment on briefly. Non-computable physics (which is 
not too weird) could make Turing computable AI impossible. As at least the world 
that is relevant for humans seems mainly to be computable we do not believe that 
it is necessary to integrate non-computable devices into an AI system. The (clever 
and nearly convincing) 'Godel' argument by Penrose that non-computational 
physics must exist and is relevant to the brain, has (in our opinion convincing) 
loopholes. 

• A more serious problem is the evolutionary information gathering process. It has 
been shown that the 'number of wisdom' Q contains a very compact tabulation of 
2" undecidable problems in its very first n binary digits [Q. Q is only enumerable 
with computation time increasing more rapidly with n, than any recursive function. 
The enormous computational power of evolution could have developed and coded 
something like Q into our genes, which significantly guides human reasoning. In 
short: Intelligence could be something complicated and evolution toward it from an 
even cleverly designed algorithm of size 0(1) could be too slow. As evolution has 
already taken place, we could add the information from our genes or brain structure 
to any/our AI system, but this means that the important part is still missing and 
a simple formal definition of AI is principally impossible. 

• For the probably biggest question about consciousness we want to give a physical 
analogy. Quantum (field) theory is the most accurate and universal physical theory 
ever invented. Although already developed in the 1930ies the big question regarding 
the interpretation of the wave function collapse is still open. Although extremely 
interesting from a philosophical point of view, it is completely irrelevant from a 
practical point of view We believe the same to be true for consciousness in 
the field of Artificial Intelligence. Philosophically highly interesting but practically 
unimportant. Whether consciousness willhe explained some day is another question. 



^^In the theory of everything, the coUapse might become of 'practical' importance and must or will be 
solved. 
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12 Conclusions 



All tasks which require intelligence to be solved can naturally be formulated as a maxi- 
mization of some expected utility in the framework of agents. We gave a functional (0) and 
an iterative (|^) formulation of such a decision theoretic agent, which is general enough to 
cover all AI problem classes, as has been demonstrated by several examples. The main re- 
maining problem is the unknown prior probability distribution /i^^ of the environment (s). 
Conventional learning algorithms are unsuitable, because they can neither handle large 
(unstructured) state spaces, nor do they converge in the theoretically minimal number 
of cycles, nor can they handle non-stationary environments appropriately. On the other 
hand, the universal semimeasure ^ (18), based on ideas from algorithmic information the- 
ory, solves the problem of the unknown prior distribution for induction problems. No 
explicit learning procedure is necessary, as ^ automatically converges to fi. We unified the 
theory of universal sequence prediction with the decision theoretic agent by replacing the 
unknown true prior fi^^ by an appropriately generalized universal semimeasure ^"^^ . We 
gave strong arguments that the resulting AI^ model is the most intelligent, parameterless 
and environmental/application independent model possible. We defined an intelligence 
order relation ( pPj ) to give a rigorous meaning to this claim. Furthermore, possible solu- 
tions to the horizon problem have been discussed. We outlined for a number of problem 
classes in sections how the AI^ model can solve them. They include sequence pre- 
diction, strategic games, function minimization and, especially, how AI^ learns to learn 
supervised. The list could easily be extended to other problem classes like classification, 
function inversion and many others. The major drawback of the AI^ model is that it is 
uncomputable, or more precisely, only asymptotically computable, which makes an im- 
plementation impossible. To overcome this problem, we constructed a modified model 
AI^*', which is still effectively more intelligent than any other time t and space / bounded 
algorithm. The computation time of AI^*' is of the order t-2K Possible further research has 
been discussed. The main directions could be to prove general and special credit bounds, 
use AI^ as a super model and explore its relation to other specialized models and finally 
improve performance with or without giving up universality. 
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